Skip to contentSkip to content
Verified credentials. On-chain. Forever.Learn more
Cover image for MinHash Similarity Sketch for a Job-Board Deduplication Pipeline
Code

MinHash Similarity Sketch for a Job-Board Deduplication Pipeline

FreeVerified credential3 weeksAdvanced

Overview

What this challenge is about.

Implement a MinHash signature generator (128 permutations) over shingled job-posting text (5-gram word shingles). Build an LSH banding index (16 bands of 8 hashes each) tuned for a Jaccard similarity threshold of 0.7. Run on a 500k-posting labeled sample (provided), measure precision, recall, and F1 at thresholds 0.6 / 0.7 / 0.8. Compare against an O(n squared) baseline on a 10k sample. Deliver a Python reference implementation, a 5-page precision/recall report, and a recommendation on whether to ship the sketch as the deduplication stage.

CredentialBlockchain-anchored
ShareableLinkedIn-ready
LanguageEnglish
PaceSelf-paced

The Brief

What you'll do, and what you'll demonstrate.

Build a MinHash + LSH sketch that finds near-duplicate job postings at a Jaccard threshold of 0.7 with at least 95 percent recall and at least 85 percent precision.

Earning criteria — what you'll demonstrate

  • Derive the relationship between LSH bands (b), rows per band (r), and the S-curve probability of collision
  • Implement MinHash with permutation-based hashing without leaking bias
  • Measure approximate-vs-exact deduplication quality with labeled data
  • Reason about the recall/precision trade controlled by b and r

Program Fit

Where this fits in your program.

Sharpens the same skills your degree expects you to demonstrate.

Skills

Skills you'll demonstrate.

Each one shows up on your verified credential.

Careers

Roles this prepares you for.

Real titles. Real skill bridges. Pick the one closest to your trajectory.

Career mappings coming soon.

One more thing

You can put a credential on your CV by Friday.