AI Glossary/Speech-to-Text (STT)

What Is Speech-to-Text (STT)?

Definition

Speech-to-text (STT), also known as automatic speech recognition (ASR), is AI technology that converts spoken language from audio input into written text, enabling machines to understand and process human speech.

How Speech-to-Text (STT) Works

Speech-to-text systems analyze audio signals to identify spoken words and transcribe them into text. Modern STT systems are built on deep learning architectures — particularly transformer-based models like OpenAI's Whisper — that have dramatically improved accuracy across accents, languages, and noisy environments. The process involves converting audio waveforms into spectrograms or other feature representations, then using neural networks to map those features to text tokens. Advanced systems handle punctuation, capitalization, speaker diarization (identifying who said what), and can process audio in real time. STT technology has become remarkably accurate, with state-of-the-art models approaching human-level transcription in many scenarios. Key applications include meeting transcription, voice assistants, accessibility features for deaf and hard-of-hearing users, medical dictation, call center analytics, and subtitle generation. The technology is also a critical component of voice-controlled AI agents and multimodal systems that need to process spoken input alongside text and images.

Real-World Examples

1

Otter.ai automatically transcribing a Zoom meeting with speaker labels and timestamps

2

A doctor dictating patient notes into an EHR system that converts speech to structured medical records in real time

3

YouTube generating automatic captions for uploaded videos using Google's speech recognition models

Recommended Tools

Related Terms