Content Fingerprinting at Scale: How UCFP Detects Near-Duplicate Text and Video
A deep dive into UCFP's canonical and perceptual fingerprinting pipelines — MinHash, SimHash, perceptual hashing, and the match orchestration layer that ties it together.
Table of Contents
The problem of detecting duplicate or near-duplicate content is deceptively hard. Exact matching is trivial — hash the content, compare hashes. But content reuse in the real world is messy: paraphrased text, re-encoded video, trimmed clips, normalised whitespace, translated excerpts.
UCFP (Universal Content Fingerprinting Framework) is my attempt to solve this robustly — a multimodal fingerprinting engine in Rust that handles both text and video content with sub-second matching.
The Two Families of Fingerprints
UCFP generates two types of fingerprints for each document or video:
Canonical fingerprints identify content after normalisation. For text, this means Unicode normalisation (NFC/NFD), lowercasing, punctuation stripping, and stopword removal. Two documents with different formatting but the same semantic content will produce the same canonical fingerprint.
Perceptual fingerprints identify content with tolerance for minor modifications. A 30-second clip trimmed from a 5-minute video, or a document with one paragraph added, should still match. This requires algorithms that are similar for similar inputs, not just identical for identical ones.
Text Fingerprinting
Canonical: SHA-256 of Normalised Text
After normalisation, the canonical fingerprint is simply SHA-256 of the token sequence. Exact matches after normalisation are caught here with zero false positives.
fn canonical_text(input: &str) -> [u8; 32] {
let normalised = normalise(input); // unicode, lowercase, strip punct
let tokens: Vec<&str> = tokenise(&normalised)
.filter(|t| !STOPWORDS.contains(t))
.collect();
let joined = tokens.join(" ");
sha256(joined.as_bytes())
}
Perceptual: MinHash + LSH
For near-duplicate detection, I use MinHash over word shingles (n-grams). MinHash estimates Jaccard similarity between two documents without comparing all pairs of shingles:
fn minhash(tokens: &[&str], k: usize, num_hashes: usize) -> Vec<u64> {
let shingles: HashSet<u64> = tokens
.windows(k)
.map(|w| xxhash64(w.join(" ").as_bytes()))
.collect();
(0..num_hashes)
.map(|seed| {
shingles.iter()
.map(|&h| hash_with_seed(h, seed))
.min()
.unwrap_or(u64::MAX)
})
.collect()
}
The resulting MinHash signature (a vector of num_hashes u64 values) is then indexed with Locality-Sensitive Hashing (LSH) for sub-linear query time. Documents with Jaccard similarity above a configurable threshold are candidate matches.
For num_hashes = 128 and a threshold of 0.8, UCFP achieves under 1% false negative rate on paraphrased news articles.
Semantic: SimHash
For documents that are paraphrased rather than copied, n-gram similarity breaks down. UCFP optionally generates a SimHash over TF-IDF weighted term vectors. SimHash maps a high-dimensional vector to a compact bit fingerprint (64 bits) such that similar vectors produce fingerprints with low Hamming distance.
Hamming distance queries are extremely fast — popcount(a XOR b) is a single CPU instruction on x86.
Video Fingerprinting
Video is more complex because the content can be spatially transformed, temporally shifted, and re-encoded.
Frame-level Perceptual Hashing
For each keyframe, UCFP generates a perceptual hash (pHash) via discrete cosine transform:
fn phash(frame: &GrayImage) -> u64 {
let resized = resize(frame, 32, 32, FilterType::Lanczos3);
let dct = compute_dct_2d(&resized);
// Use top-left 8x8 of DCT (low frequencies)
let median = median(&dct[..64]);
let mut hash = 0u64;
for (i, &val) in dct[..64].iter().enumerate() {
if val > median {
hash |= 1 << i;
}
}
hash
}
pHash is robust to:
- Brightness/contrast adjustments
- Minor crops and scaling
- Re-encoding (H.264 → H.265, different bitrates)
- Colour space changes
It is not robust to heavy spatial transformations (flips, rotations) or major temporal resampling. Those require additional processing.
Temporal Fingerprinting
A single-frame pHash isn't enough — a video is a sequence. UCFP builds a temporal signature: a sorted sequence of (timestamp, pHash) pairs, sampled at 1 fps from keyframes.
Matching two temporal signatures is done with dynamic time warping (DTW) to handle clips that are temporally offset or played at different speeds. A match is declared when the DTW distance falls below a configurable threshold.
The Match Orchestration Layer
With multiple fingerprint types across multiple modalities, UCFP needs a layer to coordinate queries and aggregate results.
pub struct MatchResult {
pub candidate_id: DocumentId,
pub canonical_match: bool,
pub minhash_similarity: f64,
pub simhash_hamming: u32,
pub phash_similarity: f64, // video only
pub dtw_distance: f64, // video only
pub confidence: f64, // weighted aggregate
pub false_positive_risk: FpRisk,
}
The confidence score is a weighted combination of the individual similarity signals. Weights are configurable and can be tuned per use case. A canonical match alone is sufficient for confidence = 1.0. MinHash similarity of 0.9 with matching SimHash contributes to a high but not perfect confidence.
FpRisk classifies the result:
Low— multiple independent signals agree, very unlikely to be a false positiveMedium— single signal match, borderline thresholdHigh— near-threshold match on a weak signal, manual review recommended
Engineering Lessons
Redb for the index. I evaluated RocksDB, LMDB, and redb (an embedded ACID key-value store written in Rust). Redb won on three axes: compile-time type safety, no C dependency (pure Rust), and a clean API. Write throughput is lower than RocksDB, but for an offline fingerprinting index, that's acceptable.
FFmpeg is a dependency, not a library. Calling FFmpeg via subprocess is messy — parsing stdout, handling error codes, managing temp files. I eventually wrote a thin Rust wrapper around the FFmpeg C API via bindgen. It was painful to set up but much cleaner at runtime.
Threshold selection is a product decision, not an engineering decision. I spent too long trying to find the "right" threshold for near-duplicate detection. The right threshold depends on the application: a copyright enforcement system wants very low false negatives; a recommendation system wants low false positives. UCFP exposes all thresholds as configuration and ships with sensible defaults for both use cases.