Skip to content

API Reference

document_to_podcast.preprocessing.data_loaders

document_to_podcast.preprocessing.data_cleaners

clean_html(text)

Clean HTML text.

This function removes
  • scripts
  • styles
  • links
  • meta tags

In addition, it calls clean_with_regex.

Examples:

>>> clean_html("<html><body><p>Hello,  world!  </p></body></html>"")
"Hello, world!"

Parameters:

Name Type Description Default
text str

The HTML text to clean.

required

Returns:

Name Type Description
str str

The cleaned text.

Source code in src/document_to_podcast/preprocessing/data_cleaners.py
def clean_html(text: str) -> str:
    """Clean HTML text.

    This function removes:
        - scripts
        - styles
        - links
        - meta tags

    In addition, it calls [clean_with_regex][document_to_podcast.preprocessing.data_cleaners.clean_with_regex].

    Examples:
        >>> clean_html("<html><body><p>Hello,  world!  </p></body></html>"")
        "Hello, world!"

    Args:
        text (str): The HTML text to clean.

    Returns:
        str: The cleaned text.
    """
    soup = BeautifulSoup(text, "html.parser")
    for tag in soup(["script", "style", "link", "meta"]):
        tag.decompose()
    text = soup.get_text()
    return clean_with_regex(text)

clean_markdown(text)

Clean Markdown text.

This function removes
  • markdown images

In addition, it calls clean_with_regex.

Examples:

>>> clean_markdown('# Title   with image ![alt text](image.jpg "Image Title")')
"Title with image"

Parameters:

Name Type Description Default
text str

The Markdown text to clean.

required

Returns:

Name Type Description
str str

The cleaned text.

Source code in src/document_to_podcast/preprocessing/data_cleaners.py
def clean_markdown(text: str) -> str:
    """Clean Markdown text.

    This function removes:
        - markdown images

    In addition, it calls [clean_with_regex][document_to_podcast.preprocessing.data_cleaners.clean_with_regex].

    Examples:
        >>> clean_markdown('# Title   with image ![alt text](image.jpg "Image Title")')
        "Title with image"

    Args:
        text (str): The Markdown text to clean.

    Returns:
        str: The cleaned text.
    """
    text = re.sub(r'!\[.*?\]\(.*?(".*?")?\)', "", text)

    return clean_with_regex(text)

clean_with_regex(text)

Clean text using regular expressions.

This function removes
  • URLs
  • emails
  • special characters
  • extra spaces

Examples:

>>> clean_with_regex("ย Hello,   world! http://example.com")
"Hello, world!"

Parameters:

Name Type Description Default
text str

The text to clean.

required

Returns:

Name Type Description
str str

The cleaned text.

Source code in src/document_to_podcast/preprocessing/data_cleaners.py
def clean_with_regex(text: str) -> str:
    """
    Clean text using regular expressions.

    This function removes:
        - URLs
        - emails
        - special characters
        - extra spaces

    Examples:
        >>> clean_with_regex("\xa0Hello,   world! http://example.com")
        "Hello, world!"

    Args:
        text (str): The text to clean.

    Returns:
        str: The cleaned text.
    """
    text = re.sub(
        r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
        "",
        text,
    )
    text = re.sub(r"[\w\.-]+@[\w\.-]+\.[\w]+", "", text)
    text = re.sub(r'[^a-zA-Z0-9\s.,!?;:"\']', "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

document_to_podcast.inference.model_loaders

TTSModel dataclass

The purpose of this class is to provide a unified interface for all the TTS models supported. Specifically, different TTS model families have different peculiarities, for example, the bark models need a BarkProcessor, the parler models need their own tokenizer, etc. This wrapper takes care of this complexity so that the user doesn't have to deal with it.

Parameters:

Name Type Description Default
model KPipeline

A TTS model that has a .generate() method or similar that takes text as input, and returns an audio in the form of a numpy array.

required
model_id str

The model's identifier string.

required
sample_rate int

The sample rate of the audio, required for properly saving the audio to a file.

required
custom_args dict

Any model-specific arguments that a TTS model might require, e.g. tokenizer.

required
Source code in src/document_to_podcast/inference/model_loaders.py
@dataclass
class TTSModel:
    """
    The purpose of this class is to provide a unified interface for all the TTS models supported.
    Specifically, different TTS model families have different peculiarities, for example, the bark models need a
    BarkProcessor, the parler models need their own tokenizer, etc. This wrapper takes care of this complexity so that
    the user doesn't have to deal with it.

    Args:
        model: A TTS model that has a .generate() method or similar
            that takes text as input, and returns an audio in the form of a numpy array.
        model_id (str): The model's identifier string.
        sample_rate (int): The sample rate of the audio, required for properly saving the audio to a file.
        custom_args (dict): Any model-specific arguments that a TTS model might require, e.g. tokenizer.
    """

    model: KPipeline
    model_id: str
    sample_rate: int
    custom_args: field(default_factory=dict)

_load_kokoro_tts(model_id, **kwargs)

Loads the kokoro model using the KPipeline from the package https://github.com/hexgrad/kokoro

Parameters:

Name Type Description Default
model_id str

Identifier for a specific model. Kokoro currently only supports one model.

required
kwargs str

Needs to include 'lang_code' necessary to set the language used for generation. For example: ๐Ÿ‡ช๐Ÿ‡ธ 'e' => Spanish es ๐Ÿ‡ซ๐Ÿ‡ท 'f' => French fr-fr ๐Ÿ‡ฎ๐Ÿ‡ณ 'h' => Hindi hi ๐Ÿ‡ฎ๐Ÿ‡น 'i' => Italian it ๐Ÿ‡ง๐Ÿ‡ท 'p' => Brazilian Portuguese pt-br ๐Ÿ‡บ๐Ÿ‡ธ 'a' => American English ๐Ÿ‡ฌ๐Ÿ‡ง 'b' => British English ๐Ÿ‡ฏ๐Ÿ‡ต 'j' => Japanese: you will need to also pip install misaki[ja] ๐Ÿ‡จ๐Ÿ‡ณ 'z' => Mandarin Chinese: you will need to also pip install misaki[zh]

{}

Returns: TTSModel: The loaded model using the TTSModel wrapper.

Source code in src/document_to_podcast/inference/model_loaders.py
def _load_kokoro_tts(model_id: str, **kwargs) -> TTSModel:
    """
    Loads the kokoro model using the KPipeline from the package https://github.com/hexgrad/kokoro

    Args:
        model_id (str): Identifier for a specific model. Kokoro currently only supports one model.
        kwargs (str): Needs to include 'lang_code' necessary to set the language used for generation. For example:
            ๐Ÿ‡ช๐Ÿ‡ธ 'e' => Spanish es
            ๐Ÿ‡ซ๐Ÿ‡ท 'f' => French fr-fr
            ๐Ÿ‡ฎ๐Ÿ‡ณ 'h' => Hindi hi
            ๐Ÿ‡ฎ๐Ÿ‡น 'i' => Italian it
            ๐Ÿ‡ง๐Ÿ‡ท 'p' => Brazilian Portuguese pt-br
            ๐Ÿ‡บ๐Ÿ‡ธ 'a' => American English
            ๐Ÿ‡ฌ๐Ÿ‡ง 'b' => British English
            ๐Ÿ‡ฏ๐Ÿ‡ต 'j' => Japanese: you will need to also pip install misaki[ja]
            ๐Ÿ‡จ๐Ÿ‡ณ 'z' => Mandarin Chinese: you will need to also pip install misaki[zh]
    Returns:
        TTSModel: The loaded model using the TTSModel wrapper.
    """
    from kokoro import KPipeline

    # If language code not supplied, assume British English
    pipeline = KPipeline(lang_code=kwargs.pop("lang_code", "b"))
    return TTSModel(
        model=pipeline,
        model_id=model_id,
        sample_rate=24000,  # Kokoro's default sample rate
        custom_args={},
    )

load_llama_cpp_model(model_id)

Loads the given model_id using Llama.from_pretrained.

Examples:

>>> model = load_llama_cpp_model("bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf")

Parameters:

Name Type Description Default
model_id str

The model id to load. Format is expected to be {org}/{repo}/{filename}.

required

Returns:

Name Type Description
Llama Llama

The loaded model.

Source code in src/document_to_podcast/inference/model_loaders.py
def load_llama_cpp_model(model_id: str) -> Llama:
    """
    Loads the given model_id using Llama.from_pretrained.

    Examples:
        >>> model = load_llama_cpp_model("bartowski/Qwen2.5-7B-Instruct-GGUF/Qwen2.5-7B-Instruct-Q8_0.gguf")

    Args:
        model_id (str): The model id to load.
            Format is expected to be `{org}/{repo}/{filename}`.

    Returns:
        Llama: The loaded model.
    """
    org, repo, filename = model_id.split("/")
    model = Llama.from_pretrained(
        repo_id=f"{org}/{repo}",
        filename=filename,
        n_ctx=0,  # 0 means that the model limit will be used, instead of the default (512) or other hardcoded value
        verbose=False,
        n_gpu_layers=-1 if torch.cuda.is_available() else 0,
    )
    return model

document_to_podcast.inference.model_loaders.TTS_LOADERS = {'hexgrad/Kokoro-82M': _load_kokoro_tts} module-attribute

document_to_podcast.inference.text_to_text

text_to_text(input_text, model, system_prompt, return_json=True, stop=None)

Transforms input_text using the given model and system prompt.

Parameters:

Name Type Description Default
input_text str

The text to be transformed.

required
model Llama

The model to use for conversion.

required
system_prompt str

The system prompt to use for conversion.

required
return_json bool

Whether to return the response as JSON. Defaults to True.

True
stop str | list[str] | None

The stop token(s).

None

Returns:

Name Type Description
str str

The full transformed text.

Source code in src/document_to_podcast/inference/text_to_text.py
def text_to_text(
    input_text: str,
    model: Llama,
    system_prompt: str,
    return_json: bool = True,
    stop: str | list[str] | None = None,
) -> str:
    """
    Transforms input_text using the given model and system prompt.

    Args:
        input_text (str): The text to be transformed.
        model (Llama): The model to use for conversion.
        system_prompt (str): The system prompt to use for conversion.
        return_json (bool, optional): Whether to return the response as JSON.
            Defaults to True.
        stop (str | list[str] | None, optional): The stop token(s).

    Returns:
        str: The full transformed text.
    """
    response = chat_completion(
        input_text, model, system_prompt, return_json, stop=stop, stream=False
    )
    return response["choices"][0]["message"]["content"]

text_to_text_stream(input_text, model, system_prompt, return_json=True, stop=None)

Transforms input_text using the given model and system prompt.

Parameters:

Name Type Description Default
input_text str

The text to be transformed.

required
model Llama

The model to use for conversion.

required
system_prompt str

The system prompt to use for conversion.

required
return_json bool

Whether to return the response as JSON. Defaults to True.

True
stop str | list[str] | None

The stop token(s).

None

Yields:

Name Type Description
str str

Chunks of the transformed text as they are available.

Source code in src/document_to_podcast/inference/text_to_text.py
def text_to_text_stream(
    input_text: str,
    model: Llama,
    system_prompt: str,
    return_json: bool = True,
    stop: str | list[str] | None = None,
) -> Iterator[str]:
    """
    Transforms input_text using the given model and system prompt.

    Args:
        input_text (str): The text to be transformed.
        model (Llama): The model to use for conversion.
        system_prompt (str): The system prompt to use for conversion.
        return_json (bool, optional): Whether to return the response as JSON.
            Defaults to True.
        stop (str | list[str] | None, optional): The stop token(s).

    Yields:
        str: Chunks of the transformed text as they are available.
    """
    response = chat_completion(
        input_text, model, system_prompt, return_json, stop=stop, stream=True
    )
    for item in response:
        if item["choices"][0].get("delta", {}).get("content", None):
            yield item["choices"][0].get("delta", {}).get("content", None)

document_to_podcast.inference.text_to_speech

_text_to_speech_kokoro(input_text, model, voice_profile)

TTS generation function for the Kokoro model Args: input_text (str): The text to convert to speech. model (KPipeline): The kokoro pipeline as defined in https://github.com/hexgrad/kokoro voice_profile (str) : a pre-defined ID for the Kokoro models (e.g. "af_bella") more info here https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md

Returns:

Type Description
ndarray

numpy array: The waveform of the speech as a 2D numpy array

Source code in src/document_to_podcast/inference/text_to_speech.py
def _text_to_speech_kokoro(
    input_text: str, model: KPipeline, voice_profile: str
) -> np.ndarray:
    """
    TTS generation function for the Kokoro model
    Args:
        input_text (str): The text to convert to speech.
        model (KPipeline): The kokoro pipeline as defined in https://github.com/hexgrad/kokoro
        voice_profile (str) : a pre-defined ID for the Kokoro models (e.g. "af_bella")
            more info here https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md

    Returns:
        numpy array: The waveform of the speech as a 2D numpy array
    """
    generator = model(input_text, voice=voice_profile)

    _, _, audio = next(generator)  # returns graphemes/text, phonemes, audio

    return np.array(audio)

text_to_speech(input_text, model, voice_profile)

Generate speech from text using a TTS model.

Parameters:

Name Type Description Default
input_text str

The text to convert to speech.

required
model TTSModel

The TTS model to use.

required
voice_profile str

The voice profile to use for the speech. The format depends on the TTSModel used.

required

Returns: np.ndarray: The waveform of the speech as a 2D numpy array

Source code in src/document_to_podcast/inference/text_to_speech.py
def text_to_speech(input_text: str, model: TTSModel, voice_profile: str) -> np.ndarray:
    """
    Generate speech from text using a TTS model.

    Args:
        input_text (str): The text to convert to speech.
        model (TTSModel): The TTS model to use.
        voice_profile (str): The voice profile to use for the speech. The format depends on the TTSModel used.
    Returns:
        np.ndarray: The waveform of the speech as a 2D numpy array
    """
    return TTS_INFERENCE[model.model_id](
        input_text, model.model, voice_profile, **model.custom_args
    )

document_to_podcast.inference.text_to_speech.TTS_INFERENCE = {'hexgrad/Kokoro-82M': _text_to_speech_kokoro} module-attribute