A Voiceover and Subtitle-Sync Workflow
Video content without accessible, accurately timed subtitles loses a substantial portion of its potential audience — viewers in sound-off environments, non-native speakers, and accessibility-dependent users. Producing and syncing subtitles manually is painstaking, and AI-generated voiceover without a sync pass often misaligns with the video timing. This guide covers building a voiceover and subtitle-sync workflow in Floniks: generating AI voiceover from a script, aligning the audio track to the video timeline, automatically transcribing and time-coding the subtitle segments, applying styled subtitle overlays, and exporting both a burned-in subtitle version and a separate SRT file for platform upload.
Why Voiceover and Subtitle Production Needs a Unified Workflow
Voiceover and subtitle production are typically treated as sequential manual steps: record or generate the audio, align it to the video cut, send the audio file to a transcription service, receive the transcript, manually adjust timecodes, format the SRT file, and upload it to the platform. Each handoff between steps is an opportunity for timing drift, transcription errors, or subtitle formatting inconsistencies. For a team producing multiple videos per week, this manual chain consumes hours that should be spent on creative work.
The workflow unification argument is straightforward: if the voiceover is AI-generated from a script, the workflow already knows what words are spoken. Transcription is unnecessary — the workflow already has the text. What it needs is the precise timing of each word in the generated audio so that subtitle segments can be constructed with accurate in-points and out-points. Modern AI audio generation models produce word-level timing data alongside the audio output; a workflow that captures and uses this timing data can generate a complete, time-accurate SRT file without any transcription step at all.
For videos that use human-recorded voiceover rather than AI-generated audio, the workflow includes a Transcription node that converts the audio to time-coded text. The output of the Transcription node is equivalent to the timing data produced by the AI generation node — both produce the same {word, start_time, end_time} data structure that the subtitle segmentation logic consumes. The rest of the workflow is identical regardless of whether the audio was generated or recorded.
Generating AI Voiceover from Script
The workflow starts with a Script Input node that accepts the voiceover script as plain text. The script should be written with natural spoken cadence in mind — shorter sentences, minimal technical jargon unless the audience is specialized, and explicit pause markers (a double line break or a [PAUSE] tag) where the speaker should breathe or where the video requires silence for a visual beat.
From the Script Input, the text passes to a Voiceover Generation node. Configure the voice character parameters: gender, age register, accent if required, speaking pace (words per minute — 140 to 160 is standard for informational content, 120 to 130 for deliberate instructional content), and emotional tone. The tone descriptor is the most impactful parameter: "warm and authoritative" produces a different audio character than "energetic and direct" or "calm and reassuring." Test two or three tone descriptors at a short sample length (the first 30 seconds of the script) before committing to a full-length generation.
The Voiceover Generation node produces two outputs: the audio file (WAV or MP3) and a word-timing data file (JSON) that contains the start and end timestamp in milliseconds for every word in the generated audio. Both outputs are required for the subtitle sync stage. If the audio generation model does not natively produce word timing data, connect the audio output to a Force Alignment node that takes the script text and the audio file and computes word-level timing through phoneme alignment. Force alignment is slightly less accurate than native timing data but produces subtitle accuracy within 50 to 100 milliseconds, which is imperceptible in normal viewing.
Audio-to-Video Alignment and Timing Calibration
Before subtitles can be placed, the voiceover audio must be correctly aligned to the video timeline. The alignment requirements depend on the type of video. For a video produced specifically to match an existing script (a tutorial, an explainer, a narrated product showcase), the audio is placed at the video start and runs continuously — alignment is trivial. For a video where the voiceover is added to pre-existing footage that was cut without a script in mind, there may be moments where the audio pace and the visual cut pace diverge, requiring pacing adjustments.
In Floniks, the Audio Alignment node accepts the voiceover audio, the video file, and an alignment specification. For simple linear alignment, set the start offset to 0ms. For videos with deliberate silence zones (a title card at the start, a call-to-action pause at the end), set the voiceover start offset to the end of the opening silence and the voiceover end to the start of the closing sequence.
For videos where the narration must match specific visual events — a product being shown at exactly the moment it is mentioned in the script — use the Timing Mark system. In the script, annotate the moment with a timing mark: "[MARK:product-reveal]." In the video timeline, place a corresponding mark at the frame where the product appears. The Audio Alignment node stretches or compresses the audio segment between consecutive timing marks to match the video timeline markers. This time-stretching uses a phase-vocoder algorithm that preserves pitch while adjusting pace, so voices do not sound artificially sped up or slowed down. The maximum stretch or compression without audible artifacts is approximately 15% in either direction; plan the script pacing to stay within this range.
Subtitle Segmentation, Styling, and Export
After the audio is aligned to the video timeline, the Subtitle Segmentation node receives the word-timing data and groups words into subtitle segments. The segmentation rules are configurable: maximum characters per segment (typically 42 for single-line or 84 for two-line), maximum segment duration in milliseconds (typically 3,000 to 4,500ms to ensure readability at normal viewing speed), and minimum gap between segments (typically 80ms to prevent subtitles from appearing to jump rapidly). The segmentation algorithm respects sentence boundaries — a segment will never break mid-sentence if the sentence fits within the character limit.
Each subtitle segment is stored as an SRT record with a sequence number, in-point timestamp, out-point timestamp, and the subtitle text. The complete SRT file is exported directly and can be uploaded to any platform that accepts external subtitle files: YouTube, LinkedIn, Vimeo, or a custom video player with WebVTT support (the workflow also includes a VTT converter node for WebVTT output).
For burned-in subtitles — required for platforms that do not support external subtitle files, or for video content that must be accessible when downloaded — a Subtitle Overlay node applies styled text captions directly to the video frames. The style configuration draws from the Brand Config node (font family, size, color, background pill opacity) so burned-in captions match the brand visual identity. The Subtitle Overlay node produces an MP4 output with permanently embedded captions. Both the clean video-with-SRT package and the burned-in caption version are produced in a single workflow run, providing flexibility for every distribution context without requiring the workflow to be run twice.
Step by step
- 1
Add a Script Input node and configure the Voiceover Generation node
Navigate to /editor and create a new workflow. Add a Script Input node and paste your voiceover text. Mark deliberate pauses with double line breaks or [PAUSE] tags. Connect the Script Input to a Voiceover Generation node. Set voice character, speaking pace in words per minute, and the tone descriptor. Generate a 30-second sample from the first section of your script and review before proceeding to full generation.
- 2
Capture word-timing data and align the audio to the video
After full-length generation, confirm the Voiceover Generation node is outputting both the audio file and the word-timing JSON. If your model does not produce timing data natively, connect the audio output to a Force Alignment node along with the script text. Then connect the aligned audio and timing data to an Audio Alignment node with the video file. Set the start offset and any timing marks that must match specific visual events.
- 3
Run subtitle segmentation and review the SRT output
Connect the word-timing data output from the alignment node to a Subtitle Segmentation node. Configure the maximum characters per line, maximum segment duration, and minimum inter-segment gap. Run the segmentation and download the SRT file. Play the video with the SRT loaded to verify timing accuracy. Adjust segmentation parameters if any segments feel too short to read or linger too long after the audio has moved on.
- 4
Export burned-in captions and SRT file in a single run
Add a Subtitle Overlay node after the Subtitle Segmentation node. Connect the styled caption parameters from your Brand Config and the aligned video to produce the burned-in MP4 output. In parallel, connect the SRT output to a VTT Converter node if WebVTT format is also required. The export node delivers both the captioned MP4 and the SRT/VTT files. Save the complete workflow as a template for reuse on future videos.
Related guides
Build it on Floniks
Image, video, digital humans, and reusable workflows on one canvas. Sign up gets you starter credits — no card required.
Explore Floniks