Curate a Domain Lexicon for a Climate-Tech NLP Stack
Overview
What this challenge is about.
You receive 5,000 policy documents and a benchmark of 200 documents with manually tagged domain terms. Curate a lexicon of ~1,500 terms with (1) canonical English form, (2) Swahili/French/Portuguese variants where they exist, (3) a 1-line definition, (4) a Wikidata QID where possible. Add the lexicon as a spaCy EntityRuler component layered on top of a baseline NER pipeline. Evaluate precision/recall improvement on the benchmark. Deliver: lexicon CSV, integrated pipeline, evaluation notebook, and a 4-page methodology note for the non-profit's grant report.
The Brief
What you'll do, and what you'll demonstrate.
Lift NER on African climate-policy documents by curating a domain lexicon and integrating it into a spaCy pipeline.
Earning criteria — what you'll demonstrate
- Curate a domain lexicon with multilingual variants
- Integrate a lexical resource into a spaCy pipeline
- Evaluate NER lift from a lexical-resource integration
- Author methodology notes suitable for grant reporting
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career paths this builds toward
Canonical rolesNLP Engineer
Lexicon curation and spaCy pipeline integration is core NLP-engineer work in any domain-specific NLP product.
This challenge sharpens
- lexical-resources
- named-entity-recognition
- spacy
Data Engineer
Curating and publishing a reusable lexical resource is the data-engineering skillset that open-data orgs hire for.
This challenge sharpens
- lexical-resources
- wikidata
- multilingual-nlp
Applied AI Scientist
Methodology-driven lift on a benchmark with honest limits is the day-to-day of applied AI scientists in research-grade NLP work.
This challenge sharpens
- named-entity-recognition
- evaluation
- multilingual-nlp