Overview
What this challenge is about.
You receive 12,000 policy PDFs and a benchmark of 200 documents with manually linked entities (places, organizations, policies). Build a pipeline that runs NER, candidate-generation against Wikidata + EuroVoc, and disambiguation (string similarity + KG-context similarity). Evaluate precision and recall at the entity level on the benchmark. Output the corpus as an RDF dataset following Linked Data principles (URIs dereferenceable, sameAs to Wikidata, attribution metadata) and publish it as a small zipped Turtle file plus a README the non-profit can host on their site.
The Brief
What you'll do, and what you'll demonstrate.
Link a 12,000-document climate-policy corpus to Wikidata and EuroVoc with measured precision and recall, and publish it as Linked Open Data.
Earning criteria — what you'll demonstrate
- Build an end-to-end entity-linking pipeline against Wikidata and EuroVoc
- Apply Linked Data publishing principles (URIs, sameAs, attribution)
- Evaluate entity-linking quality at precision/recall level
- Author methodology notes appropriate to grant-reporting expectations
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Data Engineer
Publishing a corpus as Linked Open Data with measured linking quality is the day-to-day of data engineers at research and open-data orgs.
This challenge sharpens
- linked-open-data
- rdf
- entity-linking
NLP Engineer
NER plus disambiguation against a real KG is core NLP-engineer work in any entity-linking product.
This challenge sharpens
- ner
- entity-linking
- wikidata
AI Solutions Architect
Specifying the linked-data architecture and the publishing pipeline is the AI solutions architect's role in open-data engagements.
This challenge sharpens
- linked-open-data
- rdf
- sparql