Voice SDK
Overview
The Voice SDK is a Python library that provides additional features optimized for conversational AI, built on top of our Realtime API.
We use it to build our integrations, and it is also available for you to use.
- Intelligent segmentation: groups words into meaningful speech segments per speaker.
- Turn detection: automatically detects when speakers finish talking.
- Speaker management: focus on or ignore specific speakers in multi-speaker scenarios.
- Preset configurations: offers ready-to-use settings for conversations, note-taking, and captions.
- Simplified event handling: delivers clean, structured segments instead of raw word-level events.
Segmentation
Segmentation groups words into readable chunks of text. In practice, this means you can work with finalized segments rather than stitching together word-by-word updates.
Turn detection and finalization
Turn detection determines when a speaker has finished a turn. When a turn is detected, speech is finalized into segments that you can use in your application.
Turn detection (and subsequent finalization) is important for speed: the sooner a turn is finalized, the sooner you can send a final transcript to an LLM.
We take the complexity out of this through presets.
If you prefer manual control, use the external preset and call client.finalize() to end a turn.
This sends a signal to the Speechmatics servers to finalize the current speech immediately.
Diarization and speaker management
When diarization is enabled, the Voice SDK assigns speaker IDs (for example S1, S2) and produces segments per speaker.
You can also:
- Focus on specific speakers
- Ignore specific speakers
- Provide known speakers for speaker identification
Voice SDK vs Realtime SDK
- Use the Voice SDK when:
- Building conversational AI or voice agents
- You need automatic turn detection
- You want speaker-focused transcription
- You need ready-to-use presets for common scenarios
- Use the Realtime SDK when:
- You need the raw stream of word-by-word transcription data
- Building custom segmentation logic
- You want fine-grained control over every event
- Processing audio files or custom workflows
Get started
Create an API key
Create a Speechmatics API key in the portal to access the Voice SDK. Store your key securely as a managed secret.
Install
# Standard installation
pip install speechmatics-voice
# With SMART_TURN (ML-based turn detection)
pip install speechmatics-voice[smart]
Quickstart
Here's how to stream microphone audio to the Voice Agent and transcribe finalised segments of speech, with speaker ID:
import asyncio
import os
from speechmatics.rt import Microphone
from speechmatics.voice import VoiceAgentClient, AgentServerMessageType
async def main():
"""Stream microphone audio to Speechmatics Voice Agent using 'scribe' preset"""
# Audio configuration
SAMPLE_RATE = 16000 # Hz
CHUNK_SIZE = 160 # Samples per read
PRESET = "scribe" # Configuration preset
# Create client with preset
client = VoiceAgentClient(
api_key=os.getenv("SPEECHMATICS_API_KEY"),
preset=PRESET
)
# Print finalised segments of speech with speaker ID
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
speaker = segment["speaker_id"]
text = segment["text"]
print(f"{speaker}: {text}")
# Setup microphone
mic = Microphone(SAMPLE_RATE, CHUNK_SIZE)
if not mic.start():
print("Error: Microphone not available")
return
# Connect to the Voice Agent
await client.connect()
# Stream microphone audio (interruptable using keyboard)
try:
while True:
audio_chunk = await mic.read(CHUNK_SIZE)
if not audio_chunk:
break # Microphone stopped producing data
await client.send_audio(audio_chunk)
except KeyboardInterrupt:
pass
finally:
await client.disconnect()
if __name__ == "__main__":
asyncio.run(main())
Note: Microphone is imported from the Realtime SDK (speechmatics.rt). Install with pip install speechmatics.
Events and segments
The Voice SDK emits events as transcription progresses. The two main segment events are:
ADD_PARTIAL_SEGMENT- Interim results that stream in real-time as speech is recognizedADD_SEGMENT- Final results emitted when a turn ends
How segments work
As someone speaks, you receive ADD_PARTIAL_SEGMENT events with the current transcription. These update continuously—each new partial replaces the previous one.
When a turn is detected (or you call client.finalize()), the SDK emits an ADD_SEGMENT event with the finalized transcript. This is the stable result you should use for downstream processing like sending to an LLM.
Speaking: "Hello, how are you?"
Timeline:
ADD_PARTIAL_SEGMENT: "Hello"
ADD_PARTIAL_SEGMENT: "Hello, how"
ADD_PARTIAL_SEGMENT: "Hello, how are"
ADD_PARTIAL_SEGMENT: "Hello, how are you"
(turn detected or finalize() called)
ADD_SEGMENT: "Hello, how are you?" ← Use this
Segment payload
Example ADD_SEGMENT payload:
{
"message": "AddSegment",
"segments": [
{
"speaker_id": "S1",
"is_active": true,
"timestamp": "2025-11-11T23:18:37.189+00:00",
"language": "en",
"text": "Welcome to Speechmatics.",
"metadata": {
"start_time": 1.28,
"end_time": 8.04
}
}
],
"metadata": {
"start_time": 1.28,
"end_time": 8.04,
"processing_time": 0.187
}
}
Field explanations:
speaker_id: Speaker label (e.g.,S1,S2, or custom label if using known speakers)is_active: Whether this speaker is in your focus list (see Speaker focus)timestamp: Absolute wall-clock time (ISO 8601 format)start_time/end_time: Time in seconds relative to the start of the sessionprocessing_time: Transcription latency in seconds
Subscribing to events
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_final_segment(message):
for segment in message["segments"]:
print(f"[FINAL] {segment['speaker_id']}: {segment['text']}")
@client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
def on_partial_segment(message):
for segment in message["segments"]:
print(f"[PARTIAL] {segment['speaker_id']}: {segment['text']}")
When finals are emitted
Final segments (ADD_SEGMENT) are emitted when:
- Turn detection triggers automatically (based on your preset/config)
- You call
client.finalize()manually (when usingexternalpreset)
See Turn detection for more on automatic finalization.
Presets
These are purpose-built, optimized configurations, ready for use without further modification:
FAST - low latency, fast responses
FIXED - general conversation with fixed timing
ADAPTIVE - general conversation with adaptive timing
SMART_TURN - complex conversation with ML-enhanced turn detection
EXTERNAL - user handles end of turn
SCRIBE - note-taking
CAPTIONS - live captioning
To view all available presets:
presets = VoiceAgentConfigPreset.list_presets()
Presets include defaults for all settings (language defaults to English). To change the language (or any other preset setting), use a custom configuration or use a preset as a starting point and customize with overlays.
Custom configuration
For more control, you can also specify custom configurations or use presets as a starting point and customise with overlays:
Specify configurations in a VoiceAgentConfig object:
import os
from speechmatics.voice import VoiceAgentClient, VoiceAgentConfig, EndOfUtteranceMode
config = VoiceAgentConfig(
language="en",
enable_diarization=True,
max_delay=0.7,
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
)
client = VoiceAgentClient(api_key=os.getenv("YOUR_API_KEY"), config=config)
Use presets as a starting point and customise with overlays:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Use preset with custom overrides
config = VoiceAgentConfigPreset.SCRIBE(
VoiceAgentConfig(
language="es",
max_delay=0.8
)
)
Note: If no configuration or preset is provided, the client will default to the external preset.
Basic configuration
Language and locale
language (str, default: "en")
Language code for transcription (e.g., "en", "es", "fr").
See supported languages.
output_locale (str, default: None)
Output locale for formatting (e.g., "en-GB", "en-US").
See supported languages and locales.
Model selection
operating_point (OperatingPoint, default: ENHANCED)
Select an accuracy level.
Options: STANDARD or ENHANCED.
domain (str, default: None)
Domain-specific model (e.g., "finance", "medical").
See supported languages and domains.
Vocabulary
additional_vocab (list[AdditionalVocabEntry], default: [])
Custom vocabulary for domain-specific terms.
from speechmatics.voice import AdditionalVocabEntry, VoiceAgentConfig
config = VoiceAgentConfig(
language="en",
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"]
),
AdditionalVocabEntry(content="API"),
]
)
punctuation_overrides (dict, default: None)
Custom punctuation rules. Keys are punctuation marks, values are replacement strings.
Audio
sample_rate (int, default: 16000)
Audio sample rate in Hz.
audio_encoding (AudioEncoding, default: PCM_S16LE)
Audio encoding format.
Latency and quality
max_delay (float, default: 1.0)
Maximum transcription delay in seconds for word emission.
Turn detection ensures finalisation latency is not affected.
Basic diarization
enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers.
When enabled, segments include a speaker_id field (for example S1, S2).
Basic configuration example
from speechmatics.voice import (
AdditionalVocabEntry,
AudioEncoding,
OperatingPoint,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)
overrides = VoiceAgentConfig(
# Language and locale
language="en", # e.g. "en", "es", "fr"
output_locale=None, # e.g. "en-GB", "en-US"
# Model selection
operating_point=OperatingPoint.ENHANCED, # STANDARD or ENHANCED
domain=None, # e.g. "finance", "medical"
# Vocabulary
additional_vocab=[
AdditionalVocabEntry(
content="Speechmatics",
sounds_like=["speech matters", "speech matics"],
),
AdditionalVocabEntry(content="API"),
],
punctuation_overrides=None,
# Audio
sample_rate=16000,
audio_encoding=AudioEncoding.PCM_S16LE,
# Diarization
enable_diarization=True,
)
config = VoiceAgentConfigPreset.ADAPTIVE(overrides)
Advanced configuration
Turn detection
Presets configure turn detection under the hood.
When a turn is detected (or you call client.finalize() using the external preset), we send a signal to our servers so you can get the final transcript back as quickly as possible.
This works in multi-speaker scenarios, including when diarization is enabled.
end_of_utterance_mode (EndOfUtteranceMode, default: FIXED)
Controls the base strategy for detecting turn endings:
FIXED: Uses a fixed silence threshold. Fast but may split slow speech.ADAPTIVE: Adjusts delay based on speech rate, pauses, and disfluencies. Best for natural conversation.EXTERNAL: Manual control viaclient.finalize(). For custom turn logic.
end_of_utterance_silence_trigger (float, default: 0.5)
Silence duration in seconds to trigger turn end.
end_of_utterance_max_delay (float, default: 10.0)
Maximum delay before forcing turn end.
Voice activity detection
vad_config (VoiceActivityConfig, default: None)
Configure voice activity detection:
enabled(bool, default: False) - Enable VAD.silence_duration(float, default: 0.18) - Seconds of silence before considering speech ended.threshold(float, default: 0.35) - Sensitivity threshold for detecting speech.
Smart turn (ML-enhanced detection)
smart_turn_config (SmartTurnConfig, default: None)
Enables an ML model that detects acoustic turn-taking cues (intonation, rhythm patterns) on top of the base mode.
Smart turn can be combined with FIXED or ADAPTIVE modes, but not with EXTERNAL mode.
from speechmatics.voice import (
EndOfUtteranceMode,
SmartTurnConfig,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)
# ADAPTIVE mode + ML-enhanced turn detection
config = VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
smart_turn_config=SmartTurnConfig(enabled=True),
)
# Or use the SMART_TURN preset which bundles this configuration
config = VoiceAgentConfigPreset.SMART_TURN()
Requires the [smart] extras: pip install speechmatics-voice[smart]
Segment output options
include_partials (bool, default: True)
Emit partial segments via ADD_PARTIAL_SEGMENT.
Set to False for final-only output.
include_results (bool, default: False)
Include word-level timing data in segments.
transcription_update_preset (TranscriptionUpdatePreset, default: COMPLETE)
Controls when partial segment updates are emitted.
Options: COMPLETE, COMPLETE_PLUS_TIMING, WORDS, WORDS_PLUS_TIMING, TIMING.
Segment generation options
speech_segment_config (SpeechSegmentConfig, default: SpeechSegmentConfig())
Fine-tune segment generation and post-processing:
add_trailing_eos(bool, default: False) - Append end-of-sentence markers to segments that are missing them.emit_sentences(bool, default: True) - Emit a finalized segment as soon as a sentence ends. If a speaker continues during a turn, multiple segments may be emitted.pause_mark(Optional[str], default: None) - Insert a custom string when pauses are detected (e.g.,"..."produces"Hello ... how are you?").
Advanced diarization
Sensitivity and speaker limits
enable_diarization (bool, default: False)
Enable speaker diarization to identify and label different speakers.
You must set this to True to use any of the diarization options below.
speaker_sensitivity (float, default: 0.5)
Diarization sensitivity between 0.0 and 1.0.
Higher values detect more speakers.
max_speakers (int, default: None)
Limit maximum number of speakers to detect.
Speaker grouping
prefer_current_speaker (bool, default: False)
Give extra weight to current speaker for word grouping.
Speaker focus
speaker_config (SpeakerFocusConfig, default: SpeakerFocusConfig())
Configure speaker focus/ignore rules.
When diarization is enabled, you can control which speakers appear in your output and how they are treated.
When no focus_speakers are configured, all detected speakers are treated as active (is_active: true).
Active speakers are speakers in your focus_speakers list.
Their segments have is_active: true.
Passive speakers are speakers not in focus_speakers but still included in output when using SpeakerFocusMode.RETAIN.
Their segments have is_active: false.
Ignored speakers are completely excluded from output. Their speech does not appear in segments and does not trigger turn detection.
SpeakerFocusMode options:
RETAIN: Non-focused speakers are kept in output as passive speakers (is_active: false). Use this when you want to prioritize certain speakers but still see what others say.IGNORE: Non-focused speakers are excluded entirely from output. Use this when you only care about specific speakers and want to filter out everyone else.
from speechmatics.voice import SpeakerFocusConfig, SpeakerFocusMode, VoiceAgentConfig
# Focus on specific speakers, keep others as passive
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.RETAIN
)
)
# Focus on specific speakers, exclude everyone else
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1", "S2"],
focus_mode=SpeakerFocusMode.IGNORE
)
)
# Blacklist specific speakers (exclude them from all processing)
config = VoiceAgentConfig(
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
ignore_speakers=["S3"],
)
)
In your event handler, you can use is_active to decide how to route segments:
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_segment(message):
for segment in message["segments"]:
if segment["is_active"]:
process_focused_speaker(segment["text"])
else:
process_passive_speaker(segment["speaker_id"], segment["text"])
Known speakers (speaker identification)
known_speakers (list[SpeakerIdentifier], default: [])
Pre-enrolled speaker identifiers for speaker identification.
from speechmatics.voice import SpeakerIdentifier, VoiceAgentConfig
config = VoiceAgentConfig(
enable_diarization=True,
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
SpeakerIdentifier(label="Bob", speaker_identifiers=["YY...YY"])
]
)
Advanced configuration example
from speechmatics.voice import (
EndOfUtteranceMode,
SpeakerFocusConfig,
SpeakerFocusMode,
SpeakerIdentifier,
VoiceAgentConfig,
VoiceAgentConfigPreset,
)
overrides = VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.ADAPTIVE,
enable_diarization=True,
speaker_config=SpeakerFocusConfig(
focus_speakers=["S1"],
focus_mode=SpeakerFocusMode.RETAIN,
),
known_speakers=[
SpeakerIdentifier(label="Alice", speaker_identifiers=["XX...XX"]),
],
)
config = VoiceAgentConfigPreset.ADAPTIVE(overrides)
Import and export configurations
Export and import configurations as JSON:
from speechmatics.voice import VoiceAgentConfigPreset, VoiceAgentConfig
# Export preset to JSON
config_json = VoiceAgentConfigPreset.SCRIBE().to_json()
# Load from JSON
config = VoiceAgentConfig.from_json(config_json)
# Or create from JSON string
config = VoiceAgentConfig.from_json('{"language": "en", "enable_diarization": true}')
More information
- Voice SDK on GitHub: https://github.com/speechmatics/speechmatics-python-sdk/tree/main/sdk/voice