Workflows vs Single Steps

Adding Captions and Subtitles in a Video Workflow

Updated 2026-06-19·9 min read

Key takeaway

Captions and subtitles are not an afterthought — they are a distribution requirement for social media platforms that autoplay without sound, and an accessibility mandate for most professional content. This article shows how to embed a caption generation and subtitle burn-in stage directly into a Floniks video production workflow so that every video exits the pipeline with synchronized text overlays already applied. You'll learn which nodes handle transcription, how to pass timing data between nodes, and how to configure burn-in parameters for different platform aspect ratios and brand styles without creating separate manual steps.

Workflow Editor AI Video Pro Effects

Why Captions Belong in the Generation Pipeline

Adding captions manually after video generation introduces a break in the workflow that becomes a bottleneck at scale. When you are producing ten or twenty videos in a single batch — ad variants, localized versions, social cuts — manually captioning each one multiplies hours of work. Embedding a caption node in the Floniks workflow means every video that exits the pipeline is already captioned, timestamped, and formatted for its destination platform. The distribution team receives complete assets rather than raw video that requires post-production. For social media platforms in particular, where videos autoplay muted and viewers read captions rather than listening, captioned video dramatically outperforms uncaptioned in view completion rates.

Transcription Node: Extracting Spoken Words and Timing

The caption workflow begins with a transcription node placed immediately after the video generation node. The transcription node accepts the generated video as input and outputs a structured time-coded transcript — a list of words or phrases each with a start time, end time, and confidence score. Configure the transcription node with the source language (auto-detect works for single-language content; specify explicitly for multilingual content), the output granularity (word-level timing for karaoke-style captions, phrase-level for conventional subtitle blocks), and the confidence filtering threshold (discard words below 0.7 confidence and replace with a pause marker rather than rendering potentially incorrect text).

The time-coded transcript is the critical intermediate artifact. Every downstream node — subtitle formatter, burn-in renderer, translation node — consumes this transcript. Treat it as the source of truth for timing: do not allow any downstream node to retime captions from scratch, as retiming introduces sync drift.

Subtitle Formatter: Segmenting Into Readable Blocks

A raw word-level transcript is not directly renderable as subtitles. The Subtitle Formatter node takes the transcript and applies segmentation rules that break the word stream into caption blocks of appropriate length and duration. Key parameters include maximum characters per line (typically 42 for 16:9 video, shorter for 9:16 vertical), maximum lines per block (two for most platforms), maximum block duration (five seconds is a common upper bound to prevent text staying on screen too long), and minimum gap between blocks (at least 83 milliseconds to let the viewer's eye reset between reads).

The formatter outputs a structured subtitle object — compatible with SRT, VTT, and ASS formats — that captures each block's text, start time, and end time. Wire this output to both the burn-in node (for rendering text into the video) and the subtitle file export node (for delivering the sidecar subtitle file to platforms that prefer a separate .srt). Producing both outputs in a single workflow pass means you never have to re-run the formatter to get the file format a specific platform requires.

Burn-In Renderer: Styling for Platform and Brand

The burn-in renderer composites subtitle text onto the video frames using the timing data from the formatter. Configure the renderer with font family, font size (in points or as a percentage of frame height for resolution-independence), font color, stroke or shadow settings for legibility on varied backgrounds, and vertical position (bottom 10–15 percent for 16:9; center-bottom for 9:16 vertical to avoid thumb-zone coverage on mobile). For brand-specific styling, create a named style preset in the renderer node and save it with the workflow template so every future run uses the same visual treatment without manual re-configuration.

For multi-platform distribution, add multiple parallel burn-in nodes downstream of the formatter — one for each platform format — each configured with its own aspect ratio crop and styling preset. This produces platform-ready captioned cuts simultaneously from a single workflow execution.

Translation and Multilingual Subtitle Tracks

If your content targets multiple language markets, insert a Translation node between the Subtitle Formatter and the Burn-In Renderer. The Translation node accepts the structured subtitle object and outputs a translated version with timing preserved. Because translation often changes text length — a 30-character English phrase may become 45 characters in German — the Translation node must also apply the segmentation rules from the formatter to the translated text, potentially re-breaking long translations across more lines or shorter blocks. Configure the maximum characters per line for each target language: Japanese and Chinese allow more semantic content per character and can use tighter line limits; German and Finnish frequently run long and need relaxed limits or smaller fonts.

Wire each language translation output to its own burn-in node and output file node. The workflow then produces one captioned video per language from a single source generation, with all subtitle tracks derived from the same authoritative time-coded transcript.

Quality Checks: Sync Drift and Truncation Detection

Caption quality failures fall into two categories: sync drift, where the text appears noticeably before or after the corresponding speech, and truncation, where a subtitle block is cut off mid-word or displayed for less than the minimum readable duration. Add a Caption QA node at the end of the subtitle pipeline that flags both issues. Sync drift detection compares the transcript timestamps against a re-sampled audio envelope; blocks where text appears more than 200 milliseconds early or late are flagged. Truncation detection checks that every block is displayed for at least one second. Flagged blocks are reported in the workflow task log with their timecode, allowing quick manual correction without re-running the full pipeline.

FAQ

Can this workflow handle captions for AI-generated speech as well as narrated voiceover?+

Yes. If the audio track was generated by a text-to-speech node earlier in the workflow, the transcription node can accept the source text and timing data from that TTS node directly rather than running audio transcription, which produces more accurate timing. Wire the TTS node's timing output directly to the Subtitle Formatter, bypassing the transcription node entirely for speech-generated audio.

What is the best vertical position for captions in short-form 9:16 vertical video?+

For 9:16 vertical video on platforms like TikTok and Reels, place captions at roughly 70 to 80 percent of the frame height from the top — that is, in the lower-center area above the platform's UI chrome but below the typical subject placement zone. Avoid the very bottom 10 percent, which is often obscured by platform overlays. Center the text horizontally and use a semi-transparent backing pill to maintain legibility against varied background content.

How do I handle on-screen text that conflicts with the caption area?+

If the video already contains lower-third graphics, title cards, or branded text elements that overlap the default caption zone, use the burn-in renderer's dynamic position mode. This mode analyzes each frame for existing text regions and shifts the caption block to the nearest available clear zone — typically the upper portion of the frame for frames with lower-third overlays. Configure a maximum upward shift limit to prevent captions from drifting into the subject's face area.

Related guides

Build it on Floniks

Image, video, digital humans, and reusable workflows on one canvas. Sign up gets you starter credits — no card required.

Explore Floniks