Extract Structured Lease Terms for a Commercial Real-Estate Platform
Overview
What this challenge is about.
You receive 500 anonymized lease PDFs and a labelled gold set of 150 leases with the 14 fields filled in. Build a pipeline that does (1) layout-aware PDF parsing (Unstructured, PyMuPDF, or LayoutParser), (2) field extraction using a hybrid of regex/rules + a small extractive model, (3) per-field confidence scoring, (4) a Streamlit review tool for low-confidence rows. Evaluate per-field accuracy on a held-out 50-lease test set. Deliver: pipeline, review tool, evaluation report, and a 3-page deployment recommendation for the platform team.
The Brief
What you'll do, and what you'll demonstrate.
Automate per-field lease extraction at 95% accuracy with a human-review fallback for low-confidence rows.
Earning criteria — what you'll demonstrate
- Build a hybrid rule + ML extraction pipeline on real PDF data
- Calibrate per-field confidence to route to human review
- Evaluate IE accuracy per field, not just overall
- Translate extraction performance into deployment recommendations
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
NLP Engineer
Owning an IE pipeline on real PDFs with calibrated confidence is the day-one work of NLP engineers at any vertical-document AI startup.
This challenge sharpens
- information-extraction
- named-entity-recognition
- pdf-parsing
AI Engineer
Wiring the human-in-the-loop fallback plus the rollout plan is core AI-engineer work at vertical-AI vendors.
This challenge sharpens
- human-in-the-loop
- confidence-calibration
- pdf-parsing
Data Engineer
Designing the extraction pipeline and gold-set evaluation is the data-engineering backbone of any IE product.
This challenge sharpens
- information-extraction
- evaluation
- pdf-parsing