Skip to main content
Voice agents

Voice SDK

Learn how to use the Voice SDK.

Overview

The Voice SDK is a Python library that provides additional features optimized for conversational AI, built on top of our Realtime API.

We use it to build our integrations, and it is also available for you to use.

  • Intelligent segmentation: groups words into meaningful speech segments per speaker.
  • Turn detection: automatically detects when speakers finish talking.
  • Speaker management: focus on or ignore specific speakers in multi-speaker scenarios.
  • Preset configurations: offers ready-to-use settings for conversations, note-taking, and captions.
  • Simplified event handling: delivers clean, structured segments instead of raw word-level events.

Segmentation

Segmentation groups words into readable chunks of text. In practice, this means you can work with finalized segments rather than stitching together word-by-word updates.

Turn detection and finalization

Turn detection determines when a speaker has finished a turn. When a turn is detected, speech is finalized into segments that you can use in your application.

Turn detection (and subsequent finalization) is important for speed: the sooner a turn is finalized, the sooner you can send a final transcript to an LLM.

We take the complexity out of this through presets. If you prefer manual control, use the external preset and call client.finalize() to end a turn. This sends a signal to the Speechmatics servers to finalize the current speech immediately.

Diarization and speaker management

When diarization is enabled, the Voice SDK assigns speaker IDs (for example S1, S2) and produces segments per speaker.

You can also:

  • Focus on specific speakers
  • Ignore specific speakers
  • Provide known speakers for speaker identification

Voice SDK vs Realtime SDK

  • Use the Voice SDK when:
    • Building conversational AI or voice agents
    • You need automatic turn detection
    • You want speaker-focused transcription
    • You need ready-to-use presets for common scenarios
  • Use the Realtime SDK when:
    • You need the raw stream of word-by-word transcription data
    • Building custom segmentation logic
    • You want fine-grained control over every event
    • Processing audio files or custom workflows

Get started

Create an API key

Create a Speechmatics API key in the portal to access the Voice SDK. Store your key securely as a managed secret.

Install

# Standard installation
pip install speechmatics-voice

# With SMART_TURN (ML-based turn detection)
pip install speechmatics-voice[smart]

Quickstart

Here's how to stream microphone audio to the Voice Agent and transcribe finalised segments of speech, with speaker ID:

import asyncio
import os
from speechmatics.rt import Microphone
from speechmatics.voice import VoiceAgentClient, AgentServerMessageType

async def main():
"""Stream microphone audio to Speechmatics Voice Agent using 'scribe' preset"""

# Audio configuration
SAMPLE_RATE = 16000 # Hz
CHUNK_SIZE = 160 # Samples per read
PRESET = "scribe" # Configuration preset

# Create client with preset
client = VoiceAgentClient(
api_key=os.getenv("SPEECHMATICS_API_KEY"),
preset=PRESET
)

# Print finalised segments of speech with speaker ID
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
speaker = segment["speaker_id"]
text = segment["text"]
print(f"{speaker}: {text}")

# Setup microphone
mic = Microphone(SAMPLE_RATE, CHUNK_SIZE)
if not mic.start():
print("Error: Microphone not available")
return

# Connect to the Voice Agent
await client.connect()

# Stream microphone audio (interruptable using keyboard)
try:
while True:
audio_chunk = await mic.read(CHUNK_SIZE)
if not audio_chunk:
break # Microphone stopped producing data
await client.send_audio(audio_chunk)
except KeyboardInterrupt:
pass
finally:
await client.disconnect()

if __name__ == "__main__":
asyncio.run(main())

Note: Microphone is imported from the Realtime SDK (speechmatics.rt). Install with pip install speechmatics.

Events and segments

The Voice SDK emits events as transcription progresses. The two main segment events are:

  • ADD_PARTIAL_SEGMENT - Interim results that stream in real-time as speech is recognized
  • ADD_SEGMENT - Final results emitted when a turn ends

How segments work

As someone speaks, you receive ADD_PARTIAL_SEGMENT events with the current transcription. These update continuously—each new partial replaces the previous one.

When a turn is detected (or you call client.finalize()), the SDK emits an ADD_SEGMENT event with the finalized transcript. This is the stable result you should use for downstream processing like sending to an LLM.

Speaking: "Hello, how are you?"

Timeline:
ADD_PARTIAL_SEGMENT: "Hello"
ADD_PARTIAL_SEGMENT: "Hello, how"
ADD_PARTIAL_SEGMENT: "Hello, how are"
ADD_PARTIAL_SEGMENT: "Hello, how are you"
(turn detected or finalize() called)
ADD_SEGMENT: "Hello, how are you?" ← Use this

Segment payload

Example ADD_SEGMENT payload:

{
"message": "AddSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-11-11T23:18:37.189+00:00",
"language": "en",
"text": "Welcome to Speechmatics.",
"metadata": {
"start_time": 1.28,
"end_time": 8.04
}
}
],
"metadata": {
"start_time": 1.28,
"end_time": 8.04,
"processing_time": 0.187
}
}

Field explanations:

  • speaker_id: Speaker label (e.g., S1, S2, or custom label if using known speakers)
  • is_active: Whether this speaker is in your focus list (see Speaker focus)
  • timestamp: Absolute wall-clock time (ISO 8601 format)
  • start_time / end_time: Time in seconds relative to the start of the session
  • processing_time: Transcription latency in seconds

Subscribing to events

@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_final_segment(message):
for segment in message["segments"]:
print(f"[FINAL] {segment['speaker_id']}: {segment['text']}")

@client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
def on_partial_segment(message):
for segment in message["segments"]:
print(f"[PARTIAL] {segment['speaker_id']}: {segment['text']}")

When finals are emitted

Final segments (ADD_SEGMENT) are emitted when:

  1. Turn detection triggers automatically (based on your preset/config)
  2. You call client.finalize() manually (when using external preset)

See Turn detection for more on automatic finalization.

Presets

These are purpose-built, optimized configurations, ready for use without further modification:

FAST - low latency, fast responses

FIXED - general conversation with fixed timing

ADAPTIVE - general conversation with adaptive timing

SMART_TURN - complex conversation with ML-enhanced turn detection

EXTERNAL - user handles end of turn

SCRIBE - note-taking

CAPTIONS - live captioning

To view all available presets:

presets = VoiceAgentConfigPreset.list_presets()

Presets include defaults for all settings (language defaults to English). To change the language (or any other preset setting), use a custom configuration or use a preset as a starting point and customize with overlays.

Custom configuration

For more control, you can also specify custom configurations or use presets as a starting point and customise with overlays:

Specify configurations in a VoiceAgentConfig object:

import os
from speechmatics.voice import VoiceAgentClient, VoiceAgentConfig, EndOfUtteranceMode

config = VoiceAgentConfig(
language="en",
enable_diarization=True,
max_delay=0.7,
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)

client = VoiceAgentClient(api_key=os.getenv("YOUR_API_KEY"), config=config)

Note: If no configuration or preset is provided, the client will default to the external preset.

Basic configuration

Language and locale

language (str, default: "en")
Language code for transcription (e.g., "en", "es", "fr").
See supported languages.

output_locale (str, default: None)
Output locale for formatting (e.g., "en-GB", "en-US"). See supported languages and locales.

Model selection

operating_point (OperatingPoint, default: ENHANCED)
Select an accuracy level. Options: STANDARD or ENHANCED.

domain (str, default: None)
Domain-specific model (e.g., "finance", "medical"). See supported languages and domains.

Vocabulary

additional_vocab (list[AdditionalVocabEntry], default: [])

Custom vocabulary for domain-specific terms.

from speechmatics.voice import AdditionalVocabEntry, VoiceAgentConfig

config = VoiceAgentConfig(
language="en",
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"]
),
AdditionalVocabEntry(content="API"),
]
)

punctuation_overrides (dict, default: None) Custom punctuation rules. Keys are punctuation marks, values are replacement strings.

Audio

sample_rate (int, default: 16000)
Audio sample rate in Hz.

audio_encoding (AudioEncoding, default: PCM_S16LE)
Audio encoding format.

Latency and quality

max_delay (float, default: 1.0) Maximum transcription delay in seconds for word emission. Turn detection ensures finalisation latency is not affected.

Basic diarization

enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers. When enabled, segments include a speaker_id field (for example S1, S2).

Basic configuration example

from speechmatics.voice import (
AdditionalVocabEntry,
AudioEncoding,
OperatingPoint,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)

overrides = VoiceAgentConfig(
# Language and locale
language="en", # e.g. "en", "es", "fr"
output_locale=None, # e.g. "en-GB", "en-US"

# Model selection
operating_point=OperatingPoint.ENHANCED, # STANDARD or ENHANCED
domain=None, # e.g. "finance", "medical"

# Vocabulary
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"],
),
AdditionalVocabEntry(content="API"),
],
punctuation_overrides=None,

# Audio
sample_rate=16000,
audio_encoding=AudioEncoding.PCM_S16LE,

# Diarization
enable_diarization=True,
)

config = VoiceAgentConfigPreset.ADAPTIVE(overrides)

Advanced configuration

Turn detection

Presets configure turn detection under the hood. When a turn is detected (or you call client.finalize() using the external preset), we send a signal to our servers so you can get the final transcript back as quickly as possible.

This works in multi-speaker scenarios, including when diarization is enabled.

end_of_utterance_mode (EndOfUtteranceMode, default: FIXED) Controls the base strategy for detecting turn endings:

  • FIXED: Uses a fixed silence threshold. Fast but may split slow speech.
  • ADAPTIVE: Adjusts delay based on speech rate, pauses, and disfluencies. Best for natural conversation.
  • EXTERNAL: Manual control via client.finalize(). For custom turn logic.

end_of_utterance_silence_trigger (float, default: 0.5) Silence duration in seconds to trigger turn end.

end_of_utterance_max_delay (float, default: 10.0) Maximum delay before forcing turn end.

Voice activity detection

vad_config (VoiceActivityConfig, default: None) Configure voice activity detection:

  • enabled (bool, default: False) - Enable VAD.
  • silence_duration (float, default: 0.18) - Seconds of silence before considering speech ended.
  • threshold (float, default: 0.35) - Sensitivity threshold for detecting speech.

Smart turn (ML-enhanced detection)

smart_turn_config (SmartTurnConfig, default: None) Enables an ML model that detects acoustic turn-taking cues (intonation, rhythm patterns) on top of the base mode.

Smart turn can be combined with FIXED or ADAPTIVE modes, but not with EXTERNAL mode.

from speechmatics.voice import (
EndOfUtteranceMode,
SmartTurnConfig,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)

# ADAPTIVE mode + ML-enhanced turn detection
config = VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
smart_turn_config=SmartTurnConfig(enabled=True),
)

# Or use the SMART_TURN preset which bundles this configuration
config = VoiceAgentConfigPreset.SMART_TURN()

Requires the [smart] extras: pip install speechmatics-voice[smart]

Segment output options

include_partials (bool, default: True)
Emit partial segments via ADD_PARTIAL_SEGMENT. Set to False for final-only output.

include_results (bool, default: False)
Include word-level timing data in segments.

transcription_update_preset (TranscriptionUpdatePreset, default: COMPLETE)
Controls when partial segment updates are emitted. Options: COMPLETE, COMPLETE_PLUS_TIMING, WORDS, WORDS_PLUS_TIMING, TIMING.

Segment generation options

speech_segment_config (SpeechSegmentConfig, default: SpeechSegmentConfig()) Fine-tune segment generation and post-processing:

  • add_trailing_eos (bool, default: False) - Append end-of-sentence markers to segments that are missing them.
  • emit_sentences (bool, default: True) - Emit a finalized segment as soon as a sentence ends. If a speaker continues during a turn, multiple segments may be emitted.
  • pause_mark (Optional[str], default: None) - Insert a custom string when pauses are detected (e.g., "..." produces "Hello ... how are you?").

Advanced diarization

Sensitivity and speaker limits

enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers. You must set this to True to use any of the diarization options below.

speaker_sensitivity (float, default: 0.5)
Diarization sensitivity between 0.0 and 1.0. Higher values detect more speakers.

max_speakers (int, default: None)
Limit maximum number of speakers to detect.

Speaker grouping

prefer_current_speaker (bool, default: False)
Give extra weight to current speaker for word grouping.

Speaker focus

speaker_config (SpeakerFocusConfig, default: SpeakerFocusConfig()) Configure speaker focus/ignore rules.

When diarization is enabled, you can control which speakers appear in your output and how they are treated.

When no focus_speakers are configured, all detected speakers are treated as active (is_active: true).

Active speakers are speakers in your focus_speakers list. Their segments have is_active: true.

Passive speakers are speakers not in focus_speakers but still included in output when using SpeakerFocusMode.RETAIN. Their segments have is_active: false.

Ignored speakers are completely excluded from output. Their speech does not appear in segments and does not trigger turn detection.

SpeakerFocusMode options:

  • RETAIN: Non-focused speakers are kept in output as passive speakers (is_active: false). Use this when you want to prioritize certain speakers but still see what others say.
  • IGNORE: Non-focused speakers are excluded entirely from output. Use this when you only care about specific speakers and want to filter out everyone else.
from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode, VoiceAgentConfig

# Focus on specific speakers, keep others as passive
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.RETAIN
)
)

# Focus on specific speakers, exclude everyone else
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.IGNORE
)
)

# Blacklist specific speakers (exclude them from all processing)
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
ignore_speakers=["S3"],
)
)

In your event handler, you can use is_active to decide how to route segments:

@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
if segment["is_active"]:
process_focused_speaker(segment["text"])
else:
process_passive_speaker(segment["speaker_id"], segment["text"])

Known speakers (speaker identification)

known_speakers (list[SpeakerIdentifier], default: [])
Pre-enrolled speaker identifiers for speaker identification.

from speechmatics.voice import SpeakerIdentifier, VoiceAgentConfig

config = VoiceAgentConfig(
enable_diarization=True,
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
]
)

Advanced configuration example

from speechmatics.voice import (
EndOfUtteranceMode,
SpeakerFocusConfig,
SpeakerFocusMode,
SpeakerIdentifier,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)

overrides = VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1"],
focus_mode=SpeakerFocusMode.RETAIN,
),
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
],
)

config = VoiceAgentConfigPreset.ADAPTIVE(overrides)

Import and export configurations

Export and import configurations as JSON:

from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig

# Export preset to JSON
config_json = VoiceAgentConfigPreset.SCRIBE().to_json()

# Load from JSON
config = VoiceAgentConfig.from_json(config_json)

# Or create from JSON string
config = VoiceAgentConfig.from_json('{"language": "en", "enable_diarization": true}')

More information