GUIDE

How Livestream Recaps Work: The Complete Guide to AI Stream Summaries

Published May 2, 2026 · 8 min read

What Is a Livestream Recap?

A livestream recap is an AI-generated summary of everything that has been said during a live broadcast. When a viewer joins your stream late or zones out for a while, they can type a command like !recapin chat and receive a concise summary of all the key topics, decisions, and events that occurred since the stream started.

Unlike a VOD (video on demand) that requires watching the entire recording, a recap gives viewers the essential information in 30-60 seconds. It's like having a co-host whose only job is keeping everyone up to speed.

The Technical Pipeline

An AI livestream recap system involves four main stages:

1. Audio Ingestion

The first step is capturing the audio from your livestream. This is done from a remote server (not your computer) by pulling the public stream feed. Tools like yt-dlp extract the audio stream from YouTube, Twitch, Kick, or Rumble. The audio is then piped through FFmpeg to convert it into a format suitable for speech recognition — typically 16kHz mono PCM.

2. Speech-to-Text Transcription

The audio is fed into a speech recognition engine in real-time. Two common approaches:

Whisper AI (OpenAI): An open-source model that supports 99 languages. Runs locally on the server, so there are no API costs. The "small" model (~244M parameters) provides a good balance of accuracy and speed for real-time use.
Deepgram Nova-3: A cloud-based STT API with very low latency. Supports real-time streaming transcription and automatically detects language. More accurate than Whisper for noisy audio but has API costs.

The transcription engine processes audio in small chunks (typically 1-3 seconds), converting speech to text and appending it to a growing transcript buffer.

3. Summarization with LLMs

When a viewer requests a recap, the accumulated transcript is sent to a large language model (LLM) for summarization. The prompt typically instructs the model to:

Identify the key topics discussed
Summarize important decisions or announcements
Capture the overall tone and energy of the stream
Keep the summary concise (typically 200-500 words)

Common LLM providers include Google Gemini Flash (fast and cost-effective), OpenRouter (routes to the best available model), and OpenAI GPT-4o-mini (balanced speed and quality).

4. Chat Posting

The generated summary is posted in the live chat. The method depends on the platform:

YouTube: Via the YouTube Data API using a bot account or OAuth with posting permissions
Twitch: Via IRC protocol using a bot account
Kick/Rumble: Via API endpoints or through Nightbot/StreamElements integration

Why Real-Time Matters

The key difference between a livestream recap and a VOD summary is timing. A recap needs to be generated while the stream is still live, so the system must process audio in real-time with minimal latency. This requires:

Low-latency audio streaming (yt-dlp with live flag)
Incremental transcription (processing audio as it arrives, not after the stream ends)
Fast LLM inference (models like Gemini Flash can summarize in under 2 seconds)
Efficient chat posting (API rate limits must be managed)

Post-Stream Chapters

After the stream ends, the full transcript can be analyzed to generate timestamped chapters. This is a different process from live recaps because the entire transcript is available at once. An LLM analyzes the full transcript and identifies natural break points — topic changes, segment transitions, and key moments — then generates descriptive titles with timestamps (e.g., "05:23 Boss Strategy Discussion"). These chapters are exported in YouTube description format, SRT, Markdown, or JSON.

Platform Differences

Each streaming platform has different technical requirements:

YouTube: Uses the YouTube Data API v3 for live chat and stream detection. Audio via yt-dlp from the live URL.
Twitch: Uses IRC for chat (read/write). Audio via yt-dlp from the Twitch stream URL.
Kick: Uses Pusher WebSocket for real-time chat. Audio via yt-dlp.
Rumble: Uses HTTP polling for chat (no WebSocket API). Audio via yt-dlp from the stream URL.

Getting Started with CatchUp.help

CatchUp.help implements this entire pipeline for YouTube, Twitch, Kick, and Rumble. It auto-detects when you go live, captures audio in real-time, transcribes with Whisper or Deepgram, and posts AI summaries when viewers type !recap. After your stream, it generates timestamped chapters automatically. Sign up free to get started with 10 hours/month.