Floniks
Workflows vs Single Steps

A Dataset-Prep and Cleanup Workflow

Updated 2026-06-19·12 min read
Key takeaway

Training a custom AI model or fine-tuning an existing one requires a dataset that is clean, consistently formatted, and appropriately diverse. Preparing raw image collections for training use — deduplicating near-identical samples, normalizing resolutions, removing low-quality or corrupted files, augmenting underrepresented categories, and generating training captions — is time-consuming when done manually and introduces human inconsistency at scale. This guide explains how to build a dataset-prep and cleanup workflow in Floniks that automates every stage of dataset curation from raw input collection through quality-filtered, captioned, export-ready training set.

Why Dataset Quality Determines Model Quality

The performance ceiling of any fine-tuned or custom-trained AI model is set by the quality of its training dataset. A dataset that contains duplicate samples inflates the apparent diversity of the training set while teaching the model to overweight the repeated pattern. A dataset with inconsistent resolutions causes the model to learn resolution-specific artifacts rather than content-general features. A dataset that underrepresents certain categories causes the model to perform poorly on those categories regardless of how well it performs on the overrepresented ones. A dataset with low-quality, blurry, watermarked, or corrupted images teaches the model to expect and reproduce those flaws.

These problems are easy to introduce accidentally in large image collections assembled from multiple sources — scraping, stock license purchase, internal asset libraries, or generated image batches. Each source may have different resolution standards, different quality levels, different compression artifacts, and different metadata completeness. Without a systematic cleanup workflow, these inconsistencies are fed directly into the training process and emerge as capability limitations in the trained model.

The dataset-prep workflow treats dataset curation as a production engineering problem rather than a manual review task. Each curation concern — deduplication, resolution normalization, quality filtering, category balance analysis, caption generation — is handled by a dedicated node with explicit parameters. The workflow is repeatable, auditable, and can be run on any new batch of raw images before they are added to an existing dataset, making ongoing dataset maintenance as systematic as the initial curation.

Deduplication and Near-Duplicate Detection

Exact duplicates (identical image files) are trivially detected by file hash comparison. The more problematic case is near-duplicates: images that are very similar but not identical because of minor crop differences, slight color adjustments, different compression levels applied to the same source, or generated images produced with very similar prompts and seeds. Near-duplicates dilute dataset diversity while consuming training budget and compute.

In Floniks, connect the raw image collection to a Deduplication node. This node computes a perceptual hash for each image — a compact representation of the visual content that is invariant to minor compression changes but sensitive to substantive content differences. Images with perceptual hash similarity above the configured threshold (default 0.92, range 0.85–0.98) are grouped as near-duplicate clusters. The node then selects one representative from each cluster (typically the highest-quality member by sharpness score) and routes the rest to a quarantine folder for review.

Set the threshold conservatively (0.90) for datasets where diversity is critical and slightly more aggressively (0.95) for datasets where some controlled repetition of a specific concept is intentional. After automated deduplication, the quarantine folder is worth a manual review pass — occasionally the hash comparison places legitimately distinct images in the same cluster because they share a dominant visual structure (two different product shots on white background may hash-similarly despite showing different products). Manual review of quarantine clusters takes a fraction of the time it would take to manually inspect the full raw collection.

Quality Filtering and Resolution Normalization

After deduplication, route all surviving images through a Quality Filter node. This node measures three quality signals for each image: sharpness (using a Laplacian variance metric — low values indicate blur), compression artifact presence (blocking, mosquito noise, and ringing artifacts visible at the pixel level), and watermark or overlay detection (identifying text overlays, branded watermarks, or UI elements that would teach the model to reproduce them).

Set the minimum sharpness threshold based on your model application: for fine-detail generation tasks such as texture synthesis or face generation, use a higher minimum sharpness (Laplacian variance above 150). For general scene generation where fine detail is less critical, a lower threshold (above 80) is appropriate. Images below the minimum are routed to a review folder rather than deleted outright — some soft images may be valid soft-focus artistic choices rather than focus failures, and deserve a quick human check before exclusion.

After quality filtering, route surviving images through a Resolution Normalization node. This node scales all images to a target resolution (commonly 512x512, 768x768, or 1024x1024 depending on the training pipeline) using a high-quality resizing algorithm that preserves sharpness. For images with aspect ratios that differ from the target square, the node applies your chosen handling strategy: center-crop (removes edge content to fit square), padding (adds neutral fill to reach square without cropping), or letterbox-blur (blurs and scales the image to fill the padding region). Document which strategy you use in the dataset metadata — different strategies have different implications for how the model learns compositional conventions from the training data.

Category Balance Analysis and Augmentation

A balanced dataset has roughly proportional representation of all categories the model is expected to perform well on. If a product photography dataset contains 800 images of the hero product SKU and 50 images each of the secondary SKUs, a model trained on it will generate the hero SKU far more reliably than the others. Identifying and correcting this imbalance before training is more efficient than iterating on the model after the fact.

In Floniks, connect the filtered and normalized image collection to a Category Analysis node. This node classifies each image by configurable category labels — for a product dataset this might be SKU type, color, angle, and background; for a character dataset it might be pose category, expression, and clothing type. It outputs a distribution table showing the count per category combination. Review this table to identify underrepresented categories.

For categories that are under-represented relative to your target balance, add an Augmentation node in the workflow that generates additional samples for those categories. Augmentation can take two forms: transformation augmentation (horizontal flip, slight rotation, color jitter, cropping variations applied to existing images) which increases apparent diversity without new generation, or generative augmentation (using a generation node with category-specific prompts to create genuinely new samples that match the visual style of the existing dataset). Transformation augmentation is faster and free; generative augmentation adds true diversity but requires generation credits. For most practical imbalance ratios up to 3:1, transformation augmentation is sufficient. For more severe imbalances (10:1 or greater), generative augmentation is recommended to ensure the underrepresented category is genuinely well-covered.

Generating Training Captions and Exporting the Dataset

For fine-tuning text-to-image models, each training image requires a paired text caption that describes the image content accurately. Manual captioning of thousands of images is impractical and introduces inconsistency as different annotators use different vocabulary for the same concepts. Automated captioning in the workflow produces captions at consistent vocabulary and level of detail across the entire dataset.

In Floniks, connect the balanced, quality-filtered image collection to a Caption Generation node configured for training caption format. Provide a captioning instruction that matches the training pipeline requirements: "Write a detailed, specific description of the image content. Include subject description, surface materials, lighting conditions, background scene, and any distinctive visual attributes. Use consistent terminology. Do not use metaphors or subjective quality assessments. Maximum 100 words." For CLIP-based training pipelines that use shorter captions, adjust the maximum word count accordingly.

After caption generation, connect to a Dataset Export node that packages the images and their captions in the required format: JSONL with image path and caption per line for most diffusion model pipelines, or individual .txt sidecar files alongside each .jpg for training setups that expect that structure. The export node also generates a dataset manifest — a summary file listing total image count, category distribution, resolution statistics, and captioning completion rate — which serves as the documentation for this version of the dataset. Version the manifest with a date stamp. When the dataset is updated with new images, the manifest history shows exactly when and how the composition of the training set evolved, which is essential context for interpreting changes in model behavior between fine-tuning runs.

FAQ

What perceptual hash similarity threshold should I use for near-duplicate detection?+

Start at 0.92 for most datasets. This threshold catches near-duplicates from the same source image with different crops or compression settings while allowing genuinely distinct images with similar dominant structures to pass through as separate samples. Increase to 0.95 if you intentionally include some controlled variation around key concepts (slight pose variations of the same character, for example). Decrease to 0.88 for datasets where diversity is the primary training signal and even moderate visual similarity between samples is undesirable.

When should I use generative augmentation versus transformation augmentation for imbalanced categories?+

Use transformation augmentation (flip, rotation, color jitter, crop variation) for imbalance ratios up to approximately 3:1. These transformations are fast, free, and add apparent diversity without requiring generation. Use generative augmentation for ratios of 5:1 or greater, or when you need the underrepresented category to include genuinely different compositions and scenarios rather than just transformations of a small set of existing samples. Generative augmentation adds real diversity but requires careful prompt specification to ensure the generated additions match the visual style and quality of the existing dataset samples.

Related guides

Build it on Floniks

Image, video, digital humans, and reusable workflows on one canvas. Sign up gets you starter credits — no card required.

Explore Floniks