Text-to-Speech with Google AI

Python

Text-to-Speech AI

Creating an audio book using Google’s Text-to-Speech API

Author

Christoph Scheuch

Published

February 24, 2025

Lately, I’ve found myself increasingly drawn to audiobooks and became curious about how I could create one myself using AI. In this post, I’ll walk you through how to leverage Google’s powerful Text-to-Speech (TTS) API to transform written content into high-quality audio. As an example, we’ll use the first chapter of Franz Kafka’s The Metamorphosis and turn this classic piece of literature into an audiobook - step by step.

Before we dive into generating our audiobook, let’s set up the Python environment with the necessary libraries. We’ll use Google Cloud’s Text-to-Speech client to convert text into speech, along with some utility libraries for handling environment variables and audio processing.

The key packages we’ll need are:

google-cloud-texttospeech: Google’s official client library for the Text-to-Speech API.
pydub: A simple and easy-to-use library for audio manipulation.
python-dotenv: To securely load API keys and configuration from a .env file.

import os
import time
import re

from dotenv import load_dotenv
from google.cloud import texttospeech
from pydub import AudioSegment

load_dotenv()

Prepare Text

To begin, we need the text we want to convert into audio. I downloaded the first chapter of Franz Kafka’s The Metamorphosis from Project Gutenberg, which offers a vast collection of public domain books.

with open("metamorphosis_chapter1.txt", "r", encoding="utf-8") as file:
    text = file.read()

text[:500]

'One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin. He lay on his\narmour-like back, and if he lifted his head a little he could see his\nbrown belly, slightly domed and divided by arches into stiff sections.\nThe bedding was hardly able to cover it and seemed ready to slide off\nany moment. His many legs, pitifully thin compared with the size of the\nrest of him, waved about helplessly as he looked.\n\n“What’s happened to me?” he th'

Before moving forward, skim through the loaded text to ensure there aren’t any unwanted headers, footers, or formatting issues (like excessive line breaks) that might affect the audio quality or API compatibility.

One of the challenges in working with text-to-speech APIs is handling large chunks of text. Google’s API has limitations on input size, so we need to split our text intelligently. The following approach breaks the text into manageable paragraphs while preserving the natural flow of the narrative. It splits text on double line breaks (paragraphs), ensures each chunk stays within a 2000-byte limit, and maintains paragraph integrity where possible.

def split_text_by_paragraphs(
    text: str, 
    max_bytes: int = 2000
) -> list[str]:
    """
    Split the text into chunks based on paragraphs (empty lines) and ensure each chunk is within the byte limit.

    Args:
        text (str): The input text to split.
        max_bytes (int): Maximum byte size for each chunk.

    Returns:
        list[str]: List of text chunks.
    """
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = ""
    current_bytes = 0

    for paragraph in paragraphs:
        paragraph_bytes = len(paragraph.encode("utf-8"))
        if current_bytes + paragraph_bytes + 1 > max_bytes:
            chunks.append(current_chunk.strip())
            current_chunk = paragraph
            current_bytes = paragraph_bytes
        else:
            if current_chunk:
                current_chunk += "\n\n" + paragraph
            else:
                current_chunk = paragraph
            current_bytes += paragraph_bytes + 2 

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Let’s create paragraphs and inspect the first one:

paragraphs = split_text_by_paragraphs(text)
len(paragraphs)
paragraphs[0]

'One morning, when Gregor Samsa woke from troubled dreams, he found\nhimself transformed in his bed into a horrible vermin. He lay on his\narmour-like back, and if he lifted his head a little he could see his\nbrown belly, slightly domed and divided by arches into stiff sections.\nThe bedding was hardly able to cover it and seemed ready to slide off\nany moment. His many legs, pitifully thin compared with the size of the\nrest of him, waved about helplessly as he looked.\n\n“What’s happened to me?” he thought. It wasn’t a dream. His room, a\nproper human room although a little too small, lay peacefully between\nits four familiar walls. A collection of textile samples lay spread out\non the table—Samsa was a travelling salesman—and above it there hung a\npicture that he had recently cut out of an illustrated magazine and\nhoused in a nice, gilded frame. It showed a lady fitted out with a fur\nhat and fur boa who sat upright, raising a heavy fur muff that covered\nthe whole of her lower arm towards the viewer.\n\nGregor then turned to look out the window at the dull weather. Drops of\nrain could be heard hitting the pane, which made him feel quite sad.\n“How about if I sleep a little bit longer and forget all this\nnonsense”, he thought, but that was something he was unable to do\nbecause he was used to sleeping on his right, and in his present state\ncouldn’t get into that position. However hard he threw himself onto his\nright, he always rolled back to where he was. He must have tried it a\nhundred times, shut his eyes so that he wouldn’t have to look at the\nfloundering legs, and only stopped when he began to feel a mild, dull\npain there that he had never felt before.'

Raw text often contains formatting that doesn’t translate well to speech or even triggers error codes in the API. The following clean_chunk() function emerged in another project that I was working on and prepares the text by: converting single line breaks to spaces, removing double periods, cleaning up special characters and quotation marks, eliminating parenthetical content, and handling Unicode and control characters. This cleaning process is crucial for producing natural-sounding speech without awkward pauses or artifacts. Note that the function below is not generally applicable and needs to be adapted to your specific context.

def clean_chunk(chunk) -> str: 
    """
    Cleans and formats a text chunk by removing unwanted characters, normalizing whitespace, and improving readability.

    Args:
        chunk (str): The text chunk to be cleaned.

    Returns:
        str: The cleaned and formatted text.
    """
    cleaned_chunk = re.sub(r'(?<!\n)\n(?!\n)', ' ', chunk) 
    cleaned_chunk = re.sub(r'\n{2,}', '. ', cleaned_chunk)
    cleaned_chunk = cleaned_chunk.replace("..", ".").replace("»", "").replace("«", "")
    cleaned_chunk = re.sub(r'\s-\s+', '', cleaned_chunk)
    cleaned_chunk = re.sub(r'\([^)]*\)', '', cleaned_chunk).strip()
    cleaned_chunk = cleaned_chunk.replace("\u2028", " ")
    cleaned_chunk = re.sub(r'[\x00-\x1F\x7F-\x9F]', ' ', cleaned_chunk)

    return cleaned_chunk

clean_chunk(paragraphs[0])

'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked. “What’s happened to me?” he thought. It wasn’t a dream. His room, a proper human room although a little too small, lay peacefully between its four familiar walls. A collection of textile samples lay spread out on the table—Samsa was a travelling salesman—and above it there hung a picture that he had recently cut out of an illustrated magazine and housed in a nice, gilded frame. It showed a lady fitted out with a fur hat and fur boa who sat upright, raising a heavy fur muff that covered the whole of her lower arm towards the viewer. Gregor then turned to look out the window at the dull weather. Drops of rain could be heard hitting the pane, which made him feel quite sad. “How about if I sleep a little bit longer and forget all this nonsense”, he thought, but that was something he was unable to do because he was used to sleeping on his right, and in his present state couldn’t get into that position. However hard he threw himself onto his right, he always rolled back to where he was. He must have tried it a hundred times, shut his eyes so that he wouldn’t have to look at the floundering legs, and only stopped when he began to feel a mild, dull pain there that he had never felt before.'

Convert Text to Speech

The heart of our solution lies in the text_to_speech() function, which interfaces with Google’s API. I’ve configured it with specific parameters to create a more engaging listening experience: aAdjusting pitch (-20) for a more natural sound and setting to a comfortable speaking rate (0.8). The function includes error handling and retry logic, making it robust enough for processing longer texts like books.

def text_to_speech(
    text: str, 
    output_file: str, 
    model: str = "en-US-Studio-Q",
    pitch: float = -20,
    speaking_rate: float = 0.8,
    max_retries: int = 5, 
    base_delay: float = 1.0
):
    """
    Convert text to speech and save the output as an MP3 file, with exponential backoff for retries.
    
    Args:
        text (str): The text to convert to speech.
        output_file (str): The path to save the output MP3 file.
        model (str): The model used.
        pitch (float): The pitch parameter of the model.
        speaking_rate (float): The speaking_rate parameter of the model. 
        max_retries (int): Maximum number of retries on failure.
        base_delay (float): Base delay in seconds for exponential backoff.
    """
    client = texttospeech.TextToSpeechClient()

    synthesis_input = texttospeech.SynthesisInput(text=text)

    voice = texttospeech.VoiceSelectionParams(
        language_code=model[:5],
        name=model
    )

    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3,
        pitch=pitch,
        speaking_rate=speaking_rate
    )

    retries = 0
    while retries < max_retries:
        try:
            response = client.synthesize_speech(
                input=synthesis_input,
                voice=voice,
                audio_config=audio_config
            )
            with open(output_file, "wb") as out:
                out.write(response.audio_content)
                print(f"Audio content written to file: {output_file}")
            return
        except Exception as e:
            if hasattr(e, 'code') and e.code == 500:
                retries += 1
                delay = base_delay * (2 ** (retries - 1))
                print(f"Error 500: Retrying in {delay:.2f} seconds... (Attempt {retries}/{max_retries})")
                time.sleep(delay)
            else:
                print(f"Non-retryable error: {e}")
                raise

    print(f"Failed to process text after {max_retries} retries.")
    raise RuntimeError("Max retries reached.")

To create an audio of the first paragraph of the example chapter and store it locally, you just run:

text_to_speech(paragraphs[0], "out/part1.mp3")

Process Text

Now that we have the text split into manageable chunks, cleaned them for better text-to-speech conversion, and created a function to interface with Google’s API, it’s time to process the parapgrahs and generate MP3 files. The process_text() function puts the pieces from above together and stores MP3 files for each paragraph separately.

def process_text(text: list, output_folder: str):
    """
    Process a text, split it into chunks, and generate MP3 files in the output folder.

    Args:
        text (str): A list of file paths to text files.
        output_folder (str): The folder to save the generated MP3 files.
    """
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    text_chunks = split_text_by_paragraphs(text)

    for i, chunk in enumerate(text_chunks):
        output_file_name = f"part{i+1}.mp3"
        output_file_path = os.path.join(output_folder, output_file_name)
                
        cleaned_chunk = clean_chunk(chunk)

        text_to_speech(cleaned_chunk, output_file_path)
        time.sleep(1)

To create audio files for each paragraph separately, just run:

process_text(text, "out")

Combine Individual Segments

After converting individual text chunks to speech, we need to stitch them together into a cohesive audiobook. This final step uses pydub’s AudioSegment to combine the individual MP3 files seamlessly, ensuring smooth transitions between segments. Note that you might have to run pip install audioop-lts to make this work.

input_dir = "out"
output_dir = "out"

def get_number(filename):
    return int(filename.replace('part', '').replace('.mp3', ''))

mp3_files = sorted(
    [file for file in os.listdir(input_dir) if file.endswith(".mp3")],
    key=get_number
)

combined_audio = None
for file in mp3_files:
    audio = AudioSegment.from_file(os.path.join(input_dir, file))
    combined_audio = audio if combined_audio is None else combined_audio + audio
combined_audio.export("out/chapter1.mp3", format="mp3", bitrate="320k")

You can listen to the first chapter of Kafka’s The Metamorphosis generated with the Google TTS API here:

Creating audiobooks with Google’s Text-to-Speech API is surprisingly straightforward. While the output may not match the nuanced performance of human narrators, it provides a quick and effective way to convert text content into listenable audio. This approach is particularly valuable for, e.g., rapid prototyping of audio content, or creating accessible versions of text materials. When using this system, keep the API costs and rate limits, as well as the importance of proper text processing in mind.