AI & Data
Generative AI & LLMs Challenges
Generative AI & LLMs challenges put you inside the work of building with large language models. You'll develop skills in prompt patterns, few-shot prompting, chain-of-thought, and LLM API integration, learning how these models behave before you scale them.
From there you'll handle the harder edges — RAG architectures, vector database basics, fine-tuning, and prompt versioning — putting LLM guardrails and LLM evaluation around every deployment the way AI teams actually do. Each challenge you solve earns a verified credential you can share with recruiters.
- PresentationIntermediateNew
Design a Hybrid Symbolic-Neural Agent for an Enterprise RAG Demo
Design a hybrid agent for a 'company-policy assistant' demo: a symbolic planner decomposes user goals into typed subtasks ('find policy', 'check applicability', 'compose answer'…
- Hybrid Ai
- Symbolic Planning
- RAG Architectures
Artificial Intelligence: Principles and Techniques - AnalysisIntermediateNew
Catastrophic-Forgetting Audit on a Domain Fine-Tune
You receive the fine-tuned 7B chemistry model and its base, plus a benchmark basket (MMLU subset, GSM8K, IFEval, a small instruction-following set). Run all 4 benchmarks on both…
- Catastrophic Forgetting
- LLM Evaluation
- Fine Tuning
Fine-Tuning Large Language Models - ResearchIntermediateNew
QLoRA Fine-Tune for a Customer-Support Domain Assistant
You receive 8,000 anonymized support ticket pairs (question -> agent response), the company's product documentation (around 600 pages), and a strong RAG baseline already running…
- Qlora
- Fine Tuning
- RAG Architectures
Fine-Tuning Large Language Models - CodeIntermediateNew
LoRA Fine-Tune a 7B LLM for Legal-Clause Extraction
You receive a curated extraction dataset (2,000 train, 500 val, 500 test contracts with span-level labels across 12 clause types) and a fine-tunable 7B base model (e.g., Llama-3…
- Fine Tuning
- Fine Tuning
- Parameter Efficient Tuning
Fine-Tuning Large Language Models Practice your coursework on real scenarios.
Every challenge is shaped from real industry context — not generic exercises. The work mirrors what your degree prepares you for.
Why Ewance
- ResearchIntermediateNew
Red-Team a Customer-Service Chatbot for Jailbreak Resistance
Use a published taxonomy of jailbreak categories (prompt injection, persona override, encoded payloads, multi-turn escalation, refusal bypass, tool-misuse). For each category, d…
- Red Team Operations
- Jailbreak Analysis
- LLM Evaluation
AI Safety and Alignment - ResearchSeniorNew
DPO Preference-Tune a Code Assistant for Style Compliance
You receive a 7B coding base model, a client's published code-style guide (Python, around 80 pages), and a generated preference dataset (4,000 pairs of code snippets where one m…
- Dpo
- Preference Optimization
- Fine Tuning
Fine-Tuning Large Language Models - ResearchIntermediateNew
Neuro-Symbolic Question Answering on an Enterprise Knowledge Graph
You receive a curated Turtle-format knowledge graph (around 2 million triples covering organizational structure, products, projects), 200 labeled question-SPARQL pairs split 140…
- Neuro Symbolic
- Sparql
- Knowledge Graphs
Fuzzy Logic, Knowledge Representation, and Symbolic Reasoning - CodeIntermediateNew
Instruction-Tune a Small Model for an Edtech Tutor
You receive a 1.5B base model (e.g., SmolLM-1.7B or Qwen-1.8B), permission to use 2 hours of a rented A100, and a curated seed of around 5,000 math-tutoring dialogues. Augment w…
- Instruction Tuning
- Fine Tuning
- Dataset Curation
Fine-Tuning Large Language Models - Browse challenges
Explore role
Product Manager
Ship product that solves real user problems. Combine user research, prototyping, and stakeholder alignment to turn ambiguous briefs into measurable wins — the role at the centre of modern software teams.
- ResearchIntermediateNew
Audit a Public LLM Benchmark for Validity Threats
Choose one open LLM benchmark (e.g., MMLU, GPQA, BIG-Bench-Hard, MATH). Read the benchmark paper plus at least three follow-up critiques. Audit (1) data contamination risk again…
- Benchmark Evaluation
- Data Contamination Analysis
- Annotation Methodology
AI Measurement and Evaluation - DesignIntermediateNew
Design and Pitch an LLM-Powered Tutoring Product
As a 4-person team, deliver: (1) a product concept anchored in Jobs-to-be-Done (when X, I want Y so I can Z); (2) a Figma prototype of the full flow; (3) a partially functional …
- Product Design
- User Research
- LLM Evaluation
AI Software Engineering Group Project - DesignIntermediateNew
Spec Trust-and-Safety Eval Harness for an LLM-Powered Customer-Support Bot
You will spec a 6-page evaluation harness covering: (1) jailbreak test set (about 200 prompts across 6 attack families), (2) PII-leakage probes (about 100 synthetic-customer pro…
- LLM Evaluation
- Red Team Operations
- Pii Detection
Trustworthy AI, Robustness, and Safety - CodeIntermediateNew
Fine-Tune a 3B Open-Weight Model for Customer Support Triage
You receive 40,000 anonymized labelled support tickets across 18 categories. Fine-tune a 3B open-weight model using parameter-efficient fine-tuning (LoRA) for the classification…
- Fine Tuning
- Open Weight Llms
- Classification
Large Language Models Build a verifiable portfolio.
Submissions become evidence. Reviewers with shipping experience score against a rubric; the result becomes a credential anyone can verify.
Why Ewance
- PresentationBeginnerNew
Pitch an LLM Earnings-Call Analyst to an Equity Long-Short Team
Pick 3 publicly available US tech earnings-call transcripts (from a free source like sec.gov filings or company investor-relations pages) and build a retrieval-augmented LLM wor…
- Prompt Patterns
- RAG Architectures
- LLM Evaluation
AI and Quantitative Finance - ResearchBeginnerNew
Run a Human-Preference Study Comparing Two Coding Assistants
Design a blinded paired-comparison study: 12 developer participants, each gets the same 8 realistic coding tasks (refactor, write a function, debug, test), each task is solved b…
- Experimental Design
- Statistical Evaluation
- Human Evaluation
AI Measurement and Evaluation - DesignIntermediateNew
Design a Continuous Eval Pipeline for an Enterprise RAG Product
Design (and partially build) a continuous-eval pipeline for a RAG system: (1) a structured eval set with at least 50 queries grouped by query class; (2) automated scoring (LLM-a…
- Continuous Evaluation
- LLM Evaluation
- RAG Architectures
AI Measurement and Evaluation - CodeIntermediateNew
Build an Evaluation Harness for an Internal LLM Assistant
You will design and implement an evaluation harness in Python that runs four test suites: (1) helpfulness (LLM-as-judge with rubric), (2) factual grounding (compare cited source…
- LLM Evaluation
- LLM As Judge
- Prompt Injection Testing
Large Language Models - DesignSeniorNew
Design Eval Suite for a Multimodal Brainstorming Assistant
You receive (1) the assistant's current API, (2) a list of 6 launch user-personas, and (3) the product team's quality target ('beat the previous model on 4 of 6 personas'). Desi…
- LLM Evaluation
- Multimodal Evaluation
- Safety Evaluation
Generative AI - DesignIntermediateNew
Instrument a Model Monitoring Stack from Scratch
Pick the priority product (recommend the customer-service RAG assistant, around 40k queries/day). Define monitoring signals: input drift (Evidently/NannyML), output quality (LLM…
- Model Monitoring
- Data Drift Detection
- LLM Evaluation
ML Engineering and Production ML - CodeIntermediateNew
Prototype Constitutional-AI Style Guardrails for an Internal Chatbot
Author a 'constitution' of 15 to 20 principles tailored to internal research use (no IP leakage, no off-label medical claims, no personnel-data fishing, etc.). Implement a criti…
- Constitutional Ai
- Alignment Techniques
- LLM Evaluation
AI Safety and Alignment - ResearchIntermediateNew
Run an Alignment Probe on a Coding Assistant
You will design 240 probe prompts across 3 classes: (1) over-refusal (innocuous coding asks the model should fulfill), (2) insecure code patterns (asks where the model should wa…
- Red Team Operations
- Alignment Evaluation
- LLM Evaluation
Large Language Models - ResearchIntermediateNew
Design a Capability Evaluation for an Open-Weights Coding Model
Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, …
- Capability Evaluation
- Safety Evaluation
- LLM Evaluation
AI Safety and Alignment - AnalysisIntermediateNew
Cut Latency and Cost on a High-Volume Summarization Service
You receive 30 days of anonymized request logs (prompt token counts, completion token counts, latencies, models used). Profile the cost and latency distribution, then design and…
- Finops & Cost Optimization
- Latency Optimization
- Prompt Compression
LLM Application Development - CodeIntermediateNew
Build a Domain Instruction-Tuning Recipe for a Legal Coach
You will source instruction data from three streams: ~3,000 synthetic paralegal Q&A generated by a frontier model (anonymized prompts), ~1,500 curated examples from public legal…
- Instruction Tuning
- Fine Tuning
- Data Curation
Large Language Models
How it works
From brief to credential, in six steps.
Step 01
Browse challenges aligned to your studies.
Step 02
Accept the one that fits your goals.
Step 03
Work through it with AI Copilot guidance.
Step 04
Submit for structured evaluation.
Step 05
Earn a verified credential.
Step 06
Add it to LinkedIn with one click.
Industry teams behind a decade of practitioner briefs
Hiring from this pool?
Sponsor a challenge and meet candidates through actual work.
Industry teams can shape briefs around the skills they hire for, then evaluate students on rubric-scored deliverables — not resumes.



















































































