How Livestream Recaps Work: The Complete Guide to AI Stream Summaries
What Is a Livestream Recap?
A livestream recap is an AI-generated summary of everything that has been said during a live broadcast. When a viewer joins your stream late or zones out for a while, they can type a command like !recapin chat and receive a concise summary of all the key topics, decisions, and events that occurred since the stream started.
Unlike a VOD (video on demand) that requires watching the entire recording, a recap gives viewers the essential information in 30-60 seconds. It's like having a co-host whose only job is keeping everyone up to speed.
The Technical Pipeline
An AI livestream recap system involves four main stages:
1. Audio Ingestion
The first step is capturing the audio from your livestream. This is done from a remote server (not your computer) by pulling the public stream feed. Tools like yt-dlp extract the audio stream from YouTube, Twitch, Kick, or Rumble. The audio is then piped through FFmpeg to convert it into a format suitable for speech recognition — typically 16kHz mono PCM.
2. Speech-to-Text Transcription
The audio is fed into a speech recognition engine in real-time. Two common approaches:
- Whisper AI (OpenAI): An open-source model that supports 99 languages. Runs locally on the server, so there are no API costs. The "small" model (~244M parameters) provides a good balance of accuracy and speed for real-time use.
- Deepgram Nova-3: A cloud-based STT API with very low latency. Supports real-time streaming transcription and automatically detects language. More accurate than Whisper for noisy audio but has API costs.
The transcription engine processes audio in small chunks (typically 1-3 seconds), converting speech to text and appending it to a growing transcript buffer.
3. Summarization with LLMs
When a viewer requests a recap, the accumulated transcript is sent to a large language model (LLM) for summarization. The prompt typically instructs the model to:
- Identify the key topics discussed
- Summarize important decisions or announcements
- Capture the overall tone and energy of the stream
- Keep the summary concise (typically 200-500 words)
Common LLM providers include Google Gemini Flash (fast and cost-effective), OpenRouter (routes to the best available model), and OpenAI GPT-4o-mini (balanced speed and quality).
4. Chat Posting
The generated summary is posted in the live chat. The method depends on the platform:
- YouTube: Via the YouTube Data API using a bot account or OAuth with posting permissions
- Twitch: Via IRC protocol using a bot account
- Kick/Rumble: Via API endpoints or through Nightbot/StreamElements integration
Why Real-Time Matters
The key difference between a livestream recap and a VOD summary is timing. A recap needs to be generated while the stream is still live, so the system must process audio in real-time with minimal latency. This requires:
- Low-latency audio streaming (yt-dlp with live flag)
- Incremental transcription (processing audio as it arrives, not after the stream ends)
- Fast LLM inference (models like Gemini Flash can summarize in under 2 seconds)
- Efficient chat posting (API rate limits must be managed)
Post-Stream Chapters
After the stream ends, the full transcript can be analyzed to generate timestamped chapters. This is a different process from live recaps because the entire transcript is available at once. An LLM analyzes the full transcript and identifies natural break points — topic changes, segment transitions, and key moments — then generates descriptive titles with timestamps (e.g., "05:23 Boss Strategy Discussion"). These chapters are exported in YouTube description format, SRT, Markdown, or JSON.
Platform Differences
Each streaming platform has different technical requirements:
- YouTube: Uses the YouTube Data API v3 for live chat and stream detection. Audio via yt-dlp from the live URL.
- Twitch: Uses IRC for chat (read/write). Audio via yt-dlp from the Twitch stream URL.
- Kick: Uses Pusher WebSocket for real-time chat. Audio via yt-dlp.
- Rumble: Uses HTTP polling for chat (no WebSocket API). Audio via yt-dlp from the stream URL.
Getting Started with CatchUp.help
CatchUp.help implements this entire pipeline for YouTube, Twitch, Kick, and Rumble. It auto-detects when you go live, captures audio in real-time, transcribes with Whisper or Deepgram, and posts AI summaries when viewers type !recap. After your stream, it generates timestamped chapters automatically. Sign up free to get started with 10 hours/month.