Docs
CodeRabbit
Cloudflare
AG Grid
SerpAPI
Netlify
OpenRouter
Neon
WorkOS
Clerk
Convex
Electric
PowerSync
Sentry
Railway
Prisma
Strapi
Unkey
CodeRabbit
Cloudflare
AG Grid
SerpAPI
Netlify
OpenRouter
Neon
WorkOS
Clerk
Convex
Electric
PowerSync
Sentry
Railway
Prisma
Strapi
Unkey
Class References
Function References
Interface References
Type Alias References
Variable References
Guides

Text-to-Speech

Text-to-Speech (TTS)

TanStack AI provides support for text-to-speech generation through dedicated TTS adapters. This guide covers how to convert text into spoken audio using OpenAI and Gemini providers.

Overview

Text-to-speech (TTS) is handled by TTS adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI. The TTS adapters support:

  • OpenAI: TTS-1, TTS-1-HD, and audio-capable GPT-4o models
  • Gemini: Gemini 2.5 Flash TTS (experimental)

Basic Usage

OpenAI Text-to-Speech

typescript
import { generateSpeech } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'

// Create a TTS adapter (uses OPENAI_API_KEY from environment)
const adapter = openaiSpeech()

// Generate speech from text
const result = await generateSpeech({
  adapter: openaiTTS('tts-1'),
  text: 'Hello, welcome to TanStack AI!',
  voice: 'alloy',
})

// result.audio contains base64-encoded audio data
console.log(result.format) // 'mp3'
console.log(result.contentType) // 'audio/mpeg'

Gemini Text-to-Speech (Experimental)

typescript
import { generateSpeech } from '@tanstack/ai'
import { geminiSpeech } from '@tanstack/ai-gemini'

// Create a TTS adapter (uses GOOGLE_API_KEY from environment)
const adapter = geminiSpeech()

// Generate speech from text
const result = await generateSpeech({
  adapter: geminiTTS('gemini-2.5-flash-preview-tts'),
  text: 'Hello from Gemini TTS!',
})

console.log(result.audio) // Base64 encoded audio

Options

Common Options

All TTS adapters support these common options:

OptionTypeDescription
textstringThe text to convert to speech (required)
voicestringThe voice to use for generation
formatstringOutput audio format (e.g., "mp3", "wav")

OpenAI Voice Options

OpenAI provides several distinct voices:

VoiceDescription
alloyNeutral, balanced voice
echoWarm, conversational voice
fableExpressive, storytelling voice
onyxDeep, authoritative voice
novaFriendly, upbeat voice
shimmerClear, gentle voice
ashCalm, measured voice
balladMelodic, flowing voice
coralBright, energetic voice
sageWise, thoughtful voice
versePoetic, rhythmic voice

OpenAI Format Options

FormatDescription
mp3MP3 audio (default)
opusOpus audio (good for streaming)
aacAAC audio
flacFLAC audio (lossless)
wavWAV audio (uncompressed)
pcmRaw PCM audio

Model Options

OpenAI Model Options

typescript
const result = await generateSpeech({
  adapter: openaiTTS('tts-1-hd'),
  text: 'High quality speech synthesis',
  voice: 'nova',
  format: 'mp3',
  modelOptions: {
    speed: 1.0, // 0.25 to 4.0
  },
})
OptionTypeDescription
speednumberPlayback speed (0.25 to 4.0, default 1.0)
instructionsstringVoice style instructions (GPT-4o audio models only)

Note: The instructions and stream_format options are only available with gpt-4o-audio-preview and gpt-4o-mini-audio-preview models, not with tts-1 or tts-1-hd.

Response Format

The TTS result includes:

typescript
interface TTSResult {
  id: string        // Unique identifier for this generation
  model: string     // The model used
  audio: string     // Base64-encoded audio data
  format: string    // Audio format (e.g., "mp3")
  contentType: string // MIME type (e.g., "audio/mpeg")
  duration?: number // Duration in seconds (if available)
}

Playing Audio in the Browser

typescript
// Convert base64 to audio and play
function playAudio(result: TTSResult) {
  const audioData = atob(result.audio)
  const bytes = new Uint8Array(audioData.length)
  for (let i = 0; i < audioData.length; i++) {
    bytes[i] = audioData.charCodeAt(i)
  }
  
  const blob = new Blob([bytes], { type: result.contentType })
  const url = URL.createObjectURL(blob)
  
  const audio = new Audio(url)
  audio.play()
  
  // Clean up when done
  audio.onended = () => URL.revokeObjectURL(url)
}

Saving Audio to File (Node.js)

typescript
import { writeFile } from 'fs/promises'

async function saveAudio(result: TTSResult, filename: string) {
  const audioBuffer = Buffer.from(result.audio, 'base64')
  await writeFile(filename, audioBuffer)
  console.log(`Saved to ${filename}`)
}

// Usage
const result = await generateSpeech({
  adapter: openaiTTS('tts-1'),
  text: 'Hello world!',
})

await saveAudio(result, 'output.mp3')

Full-Stack Usage

TanStack AI provides React hooks and server-side streaming helpers to build full-stack text-to-speech with minimal boilerplate.

Streaming Mode (Server Route + Client Hook)

Server — Create an API route that wraps generateSpeech as a streaming response:

typescript
// routes/api/generate/speech.ts
import { generateSpeech, toServerSentEventsResponse } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'
import { createFileRoute } from '@tanstack/react-router'

export const Route = createFileRoute('/api/generate/speech')({
  server: {
    handlers: {
      POST: async ({ request }) => {
        const body = await request.json()
        const { text, voice, format, model } = body.data

        const stream = generateSpeech({
          adapter: openaiTTS(model ?? 'tts-1'),
          text,
          voice,
          format,
          stream: true,
        })

        return toServerSentEventsResponse(stream)
      },
    },
  },
})

Client — Use the useGenerateSpeech hook with a connection adapter:

tsx
import { useGenerateSpeech, fetchServerSentEvents } from '@tanstack/ai-react'

function SpeechGenerator() {
  const { generate, result, isLoading, error } = useGenerateSpeech({
    connection: fetchServerSentEvents('/api/generate/speech'),
  })

  const playAudio = () => {
    if (!result) return
    const audioData = atob(result.audio)
    const bytes = new Uint8Array(audioData.length)
    for (let i = 0; i < audioData.length; i++) {
      bytes[i] = audioData.charCodeAt(i)
    }
    const blob = new Blob([bytes], { type: result.contentType })
    const url = URL.createObjectURL(blob)
    const audio = new Audio(url)
    audio.play()
    audio.onended = () => URL.revokeObjectURL(url)
  }

  return (
    <div>
      <button
        onClick={() => generate({ text: 'Hello, welcome to TanStack AI!' })}
        disabled={isLoading}
      >
        {isLoading ? 'Generating...' : 'Generate Speech'}
      </button>
      {error && <p>Error: {error.message}</p>}
      {result && <button onClick={playAudio}>Play Audio</button>}
    </div>
  )
}

Direct Mode (Server Function + Fetcher)

For non-streaming usage with TanStack Start server functions:

typescript
// lib/server-functions.ts
import { createServerFn } from '@tanstack/react-start'
import { generateSpeech } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'

export const generateSpeechFn = createServerFn({ method: 'POST' })
  .inputValidator((data: { text: string; voice?: string }) => data)
  .handler(async ({ data }) => {
    return generateSpeech({
      adapter: openaiTTS('tts-1'),
      text: data.text,
      voice: data.voice,
    })
  })
tsx
import { useGenerateSpeech } from '@tanstack/ai-react'
import { generateSpeechFn } from '../lib/server-functions'

function SpeechGenerator() {
  const { generate, result, isLoading } = useGenerateSpeech({
    fetcher: (input) => generateSpeechFn({ data: input }),
  })
  // ... same UI as above
}

Server Function Streaming (Fetcher + Response)

For TanStack Start server functions that stream results. The fetcher receives type-safe input and returns an SSE Response — the client parses it automatically:

typescript
// lib/server-functions.ts
import { createServerFn } from '@tanstack/react-start'
import { generateSpeech, toServerSentEventsResponse } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'

export const generateSpeechStreamFn = createServerFn({ method: 'POST' })
  .inputValidator((data: { text: string; voice?: string }) => data)
  .handler(({ data }) => {
    return toServerSentEventsResponse(
      generateSpeech({
        adapter: openaiTTS('tts-1'),
        text: data.text,
        voice: data.voice,
        stream: true,
      }),
    )
  })
tsx
import { useGenerateSpeech } from '@tanstack/ai-react'
import { generateSpeechStreamFn } from '../lib/server-functions'

function SpeechGenerator() {
  const { generate, result, isLoading } = useGenerateSpeech({
    fetcher: (input) => generateSpeechStreamFn({ data: input }),
  })
  // ... same UI as above
}

Hook API

The useGenerateSpeech hook accepts:

OptionTypeDescription
connectionConnectionAdapterStreaming transport (SSE, HTTP stream, custom)
fetcher(input) => Promise<TTSResult | Response>Direct async function, or server function returning an SSE Response
onResult(result) => TOutput | null | voidCallback when audio is generated. Optionally return a transformed value (see Result Transform)
onError(error) => voidCallback on error
onProgress(progress, message?) => voidProgress updates (0-100)

And returns:

PropertyTypeDescription
generate(input: SpeechGenerateInput) => Promise<void>Trigger generation
resultTOutput | nullThe result (or transformed result), or null
isLoadingbooleanWhether generation is in progress
errorError | undefinedCurrent error, if any
statusGenerationClientState'idle' | 'generating' | 'success' | 'error'
stop() => voidAbort the current generation
reset() => voidClear result, error, and return to idle

Result Transform

The onResult callback can optionally return a transformed value that replaces the stored result. This is useful for converting raw API responses into a more convenient format for your components.

Transform behavior:

  • Return a non-null value to replace the stored result with the transformed value
  • Return null to keep the previous result unchanged (useful for filtering)
  • Return nothing (void) to store the raw result as-is (backward compatible)

Example: Convert base64 audio to a playable Audio element

tsx
import { useGenerateSpeech, fetchServerSentEvents } from '@tanstack/ai-react'

function SpeechPlayer() {
  const { generate, result, isLoading } = useGenerateSpeech({
    connection: fetchServerSentEvents('/api/generate/speech'),
    onResult: (raw) => {
      const audioData = atob(raw.audio)
      const bytes = new Uint8Array(audioData.length)
      for (let i = 0; i < audioData.length; i++) {
        bytes[i] = audioData.charCodeAt(i)
      }
      const blob = new Blob([bytes], { type: raw.contentType })
      const url = URL.createObjectURL(blob)
      return {
        audio: new Audio(url),
        duration: raw.duration,
      }
    },
  })

  return (
    <div>
      <button
        onClick={() => generate({ text: 'Hello world!', voice: 'alloy' })}
        disabled={isLoading}
      >
        Generate
      </button>
      {result && (
        <button onClick={() => result.audio.play()}>
          Play Audio
        </button>
      )}
    </div>
  )
}

TypeScript automatically infers the result type from your onResult return value — no explicit generic parameter needed. In this example, result is inferred as { audio: HTMLAudioElement; duration?: number } | null, so result.audio.play() is fully type-safe.

Model Availability

OpenAI Models

ModelQualitySpeedUse Case
tts-1StandardFastReal-time applications
tts-1-hdHighSlowerProduction audio
gpt-4o-audio-previewHighestVariableAdvanced voice control
gpt-4o-mini-audio-previewHighFastBalanced quality/speed

Gemini Models

ModelStatusNotes
gemini-2.5-flash-preview-ttsExperimentalMay require Live API for full features

Error Handling

typescript
try {
  const result = await generateSpeech({
    adapter: openaiTTS('tts-1'),
    text: 'Hello!',
  })
} catch (error) {
  if (error.message.includes('exceeds maximum length')) {
    console.error('Text is too long (max 4096 characters)')
  } else if (error.message.includes('Speed must be between')) {
    console.error('Invalid speed value')
  } else {
    console.error('TTS error:', error.message)
  }
}

Environment Variables

The TTS adapters use the same environment variables as other adapters:

  • OpenAI: OPENAI_API_KEY
  • Gemini: GOOGLE_API_KEY or GEMINI_API_KEY

Explicit API Keys

For production use or when you need explicit control:

typescript
import { createOpenaiTTS } from '@tanstack/ai-openai'
import { createGeminiTTS } from '@tanstack/ai-gemini'

// OpenAI
const openaiAdapter = createOpenaiTTS('your-openai-api-key')

// Gemini
const geminiAdapter = createGeminiTTS('your-google-api-key')

Best Practices

  1. Text Length: OpenAI TTS supports up to 4096 characters per request. For longer content, split into chunks.

  2. Voice Selection: Choose voices appropriate for your content—use onyx for authoritative content, nova for friendly interactions.

  3. Format Selection: Use mp3 for general use, opus for streaming, wav for further processing.

  4. Caching: Cache generated audio to avoid regenerating the same content.

  5. Error Handling: Always handle errors gracefully, especially for user-facing applications.