MinHash Similarity Sketch for a Job-Board Deduplication Pipeline
Overview
What this challenge is about.
Implement a MinHash signature generator (128 permutations) over shingled job-posting text (5-gram word shingles). Build an LSH banding index (16 bands of 8 hashes each) tuned for a Jaccard similarity threshold of 0.7. Run on a 500k-posting labeled sample (provided), measure precision, recall, and F1 at thresholds 0.6 / 0.7 / 0.8. Compare against an O(n squared) baseline on a 10k sample. Deliver a Python reference implementation, a 5-page precision/recall report, and a recommendation on whether to ship the sketch as the deduplication stage.
The Brief
What you'll do, and what you'll demonstrate.
Build a MinHash + LSH sketch that finds near-duplicate job postings at a Jaccard threshold of 0.7 with at least 95 percent recall and at least 85 percent precision.
Earning criteria — what you'll demonstrate
- Derive the relationship between LSH bands (b), rows per band (r), and the S-curve probability of collision
- Implement MinHash with permutation-based hashing without leaking bias
- Measure approximate-vs-exact deduplication quality with labeled data
- Reason about the recall/precision trade controlled by b and r
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career mappings coming soon.