Overview
What this challenge is about.
Receive an inventory of the 200TB hot tier (around 1,200 tables, around 38 PB of historical data referenced), the current Spark + Trino read patterns, and 6 months of schema-change tickets. Design the migration: per-table strategy (in-place rewrite for small tables vs. write-once-rewrite-incrementally for the 30 largest), catalog choice (Glue vs. Nessie vs. REST), and the cutover plan for live writes. Prototype on a 5TB subset (around 80 tables): migrate, run validation queries via Spark + Trino, measure read-perf delta vs. raw Parquet. Deliver migration scripts, a validation report, a 10-page runbook for the full 200TB, and a 4-page memo on operational ongoing costs.
The Brief
What you'll do, and what you'll demonstrate.
Migrate a 200TB Parquet data lake to Iceberg with no read regression and a runbook the data-platform team can execute table-by-table.
Earning criteria — what you'll demonstrate
- Choose the right Iceberg catalog for a multi-engine read pattern
- Design per-table migration strategies for varying scale
- Validate that schema evolution + time travel work end-to-end
- Produce a runbook the data platform team can execute table-by-table
Program Fit
Where this fits in your program.
Sharpens the same skills your degree expects you to demonstrate.
Skills
Skills you'll demonstrate.
Each one shows up on your verified credential.
Careers
Roles this prepares you for.
Real titles. Real skill bridges. Pick the one closest to your trajectory.
Career mappings coming soon.