AI Research
AI Safety Researcher
Think of this role as the loyal opposition inside an AI lab. While teammates race to make a model more capable, AI safety researchers ask what happens when it succeeds — at the wrong thing, for the wrong reasons, in the wrong hands.
The work spans red-teaming prompts, designing constitutional methods that nudge models toward principled behavior, and translating findings into guardrails that product teams can actually adopt. Good work here is rigorous and humble: it admits what's still unknown rather than papering over it.
Students grow into this path by pairing technical depth in PyTorch with reading widely across ethics, policy, and security. The field rewards people who can hold both at once.
- CodeIntermediateNew
Build an Evaluation Harness for an Internal LLM Assistant
You will design and implement an evaluation harness in Python that runs four test suites: (1) helpfulness (LLM-as-judge with rubric), (2) factual grounding (compare cited source…
- LLM Evaluation
- LLM As Judge
- Prompt Injection Testing
Large Language Models - CodeIntermediateNew
Constitutional AI Critique Loop for Hallucination Reduction
You receive the meal-planning prompts (60 test cases with dietary constraints), an unrevised baseline (single-pass instruction-tuned model), and an empty nutrition-constraint co…
- Constitutional Ai
- Self Critique
- Alignment Prompting
Machine Learning from Human Preferences (RLHF and Alignment) - CodeIntermediateNew
Prototype Constitutional-AI Style Guardrails for an Internal Chatbot
Author a 'constitution' of 15 to 20 principles tailored to internal research use (no IP leakage, no off-label medical claims, no personnel-data fishing, etc.). Implement a criti…
- Constitutional Ai
- Alignment Techniques
- LLM Evaluation
AI Safety and Alignment - ResearchIntermediateNew
Run an Adversarial-Robustness Audit on a Face-Liveness Model for a Fintech
You receive a stand-in face-liveness model with the same backbone as the production model plus a labeled evaluation set of 2,000 frames. Apply three standard digital attacks (FG…
- Adversarial Robustness Research
- Face Liveness
- Pytorch Or Tensorflow
Deep Learning for Computer Vision Practice your coursework on real scenarios.
Every challenge is shaped from real-world context — not generic exercises. The work mirrors what your degree prepares you for.
Why Ewance
- DesignIntermediateNew
Score Compliance Risk for an Enterprise AI Rollout Pipeline
You will design a compliance-risk scoring methodology covering 8 attributes (data residency, model provider, retention policy, PII handling, audit trail, encryption, third-party…
- Risk Scoring
- Compliance Modeling
- Decision Support Systems
Decision Support Systems and Decision Analysis - CodeIntermediateNew
De-Identify Patient Images for a Pharma Research Pipeline
You receive 500 internal benchmark images (already cleared for use), each labelled with bounding boxes around face/tattoo/jewelry regions. Build a pipeline that detects these re…
- Image De Identification
- Object Detection
- Privacy Preserving Vision
Image Processing and Computational Imaging - ResearchIntermediateNew
Safety-Test a Customer-Service Agent for Adversarial Prompts
You receive a sandboxed instance of the agent (a tool-using LLM that can read account balances and open support tickets — both mocked). Design a red-team suite of at least 80 pr…
- Ai Agents
- Red Team Operations
- Adversarial Prompts
AI Agents and LLM-Based Agents - CodeIntermediateNew
Prompt-Injection Hardening for a Customer-Support Agent
You receive the current agent prompt, the pen-tester's 60-attack injection test set (direct prompt injection, indirect via doc content, refusal-bypass, and exfiltration), and a …
- Prompt Injection Defense
- System Prompt Design
- Red Team Operations
Prompt Engineering - Browse challenges
Explore role
Product Manager
Ship product that solves real user problems. Combine user research, prototyping, and stakeholder alignment to turn ambiguous briefs into measurable wins — the role at the centre of modern software teams.
- ResearchSeniorNew
Investigate Why Our Generative Model Memorizes Training Data
Pick a small open-source diffusion model (e.g., a Stable-Diffusion-class community model trained on LAION-subset). Reproduce a published membership-inference + extraction probe …
- Generative Models
- Memorization Analysis
- Differential Privacy
Advanced Deep Learning - AnalysisBeginnerNew
Audit a Hiring-Screen Classifier for Fairness Across Cohorts
You receive the classifier as a black-box API and a synthetic-but-realistic dataset of 8,000 CVs with imputed demographic proxies (gender, age band, regional cluster) and labele…
- Fairness Evaluation
- Disparate Impact
- Audit Methodology
Trustworthy AI, Robustness, and Safety - ResearchSeniorNew
Stress-Test Scalable Oversight on a Tool-Using Agent
Design a sandwich-oversight study: pick a task domain where non-expert oversight is plausible but not trivial (e.g., reviewing data-analysis steps, checking small bug fixes, eva…
- Scalable Oversight
- Alignment Research
- Experimental Design
AI Safety and Alignment - AnalysisIntermediateNew
Audit a Sepsis Early-Warning Model for Subgroup Performance
You receive a pre-trained vendor model, the training-data summary, and a held-out hospital-network evaluation set (about 18,000 ICU stays with sepsis labels). Compute AUROC + AU…
- Model Evaluation
- Fairness Metrics
- Model Calibration
Machine Learning for Healthcare and Biomedicine Build a verifiable portfolio.
Submissions become evidence. Reviewers with shipping experience score against a rubric; the result becomes a credential anyone can verify.
Why Ewance
- ResearchBeginnerNew
Case-Study Analysis of a Public AI Incident
Pick one public AI incident (suggestions: a chatbot's harmful response that went viral, a facial-recognition false-arrest case, a financial-model bias scandal). Produce a 6-page…
- Incident Analysis
- Responsible Ai
- Case Study Research
AI Ethics, Fairness, and Responsible AI - DesignIntermediateNew
Spec Trust-and-Safety Eval Harness for an LLM-Powered Customer-Support Bot
You will spec a 6-page evaluation harness covering: (1) jailbreak test set (about 200 prompts across 6 attack families), (2) PII-leakage probes (about 100 synthetic-customer pro…
- LLM Evaluation
- Red Team Operations
- Pii Detection
Trustworthy AI, Robustness, and Safety - AnalysisBeginnerNew
Stress-Test a Hiring-Funnel Model for Bias
You receive a synthetic-but-realistic dataset of 25,000 past applicants with features (years of experience, education tier, prior role tags) and outcome labels (advanced past th…
- Model Evaluation
- Fairness Metrics
- Logistic Regression
Machine Learning (Undergraduate) - AnalysisBeginnerNew
Audit Safety Stops for a Cafe-Service Robot Pilot
You receive 30 days of logs covering 240 near-miss events (close approach to a human, low-battery emergency, network loss). For each event, classify whether the safety stop trig…
- Safety Analysis
- Incident Review
- Failure Mode Analysis
Robotics - CodeIntermediateNew
Train a Differentially Private Classifier on Medical Records
Use Opacus (PyTorch DP-SGD library). Train a tabular classifier (small MLP + gradient-boosted features) with DP-SGD at the agreed epsilon/delta. Run an accuracy-vs-privacy front…
- Differential Privacy
- Dp Sgd
- Opacus
Privacy-Preserving Machine Learning - ResearchIntermediateNew
Red-Team an Image-Classification Pipeline for a Banking KYC Workflow
You receive the production image classifier as a black-box API plus a labeled validation set of 5,000 ID images. Run untargeted FGSM and PGD attacks (L_inf budget 4/255 and 8/25…
- Adversarial Attacks
- Robust Evaluation
- Red Team Operations
Trustworthy AI, Robustness, and Safety - ResearchIntermediateNew
Run an Alignment Probe on a Coding Assistant
You will design 240 probe prompts across 3 classes: (1) over-refusal (innocuous coding asks the model should fulfill), (2) insecure code patterns (asks where the model should wa…
- Red Team Operations
- Alignment Evaluation
- LLM Evaluation
Large Language Models - ResearchIntermediateNew
Audit an Agentic Workflow for Safety Failures
Read the system's existing capability spec + tool-allow-list. Design 50+ adversarial inputs across categories: prompt-injection, tool-confusion, scope-escape (agent does somethi…
- Ai Red Teaming
- Agent Safety
- Prompt Injection
Multi-Agent Systems - CodeIntermediateNew
RAG Faithfulness Evaluation for a Medical-Education Assistant
You receive 200 student-style questions, two RAG configurations (config A: vector-only + GPT-class generator; config B: hybrid + rerank + GPT-class generator), and the medical-t…
- RAG Evaluation
- Faithfulness
- LLM As Judge
Retrieval-Augmented Generation - ResearchSeniorNew
Audit a Production Model for Membership Inference Attacks
Use a black-box membership inference attack (e.g., the LiRA or shadow-model attack). You have query access to a sandboxed copy of the model + the original training data labels f…
- Membership Inference
- Privacy Attacks
- Model Evaluation
Privacy-Preserving Machine Learning - ResearchBeginnerNew
Plan a Field Study for an Autonomous Sidewalk Delivery Robot
You will design a mixed-methods field study spanning two weeks of observation on a fixed route, intercept surveys with ~80 pedestrians, and 8 short interviews with neighborhood …
- Field Study Design
- Human Robot Interaction
- Research Ethics
Human-Robot Interaction - ResearchIntermediateNew
Red-Team Evaluation of a Refusal Policy
You receive the lab's written refusal policy (version 2.3) and a starter set of 60 red-team prompts (10 per category). Extend the set to 240 prompts (40 per category) using docu…
- Red Team Operations
- Refusal Policy
- Alignment Evaluation
Machine Learning from Human Preferences (RLHF and Alignment) - ResearchIntermediateNew
Audit Recommender Filter Bubbles for a Civic Forum
You receive 90 days of impression logs (about 30 million recommendation events) tagged with content viewpoint labels (left-leaning, center, right-leaning, non-political) from an…
- Recommender Evaluation
- Diversity Metrics
- Audit Methodology
Social Network Analysis and Web Science - AnalysisBeginnerNew
Audit a Hiring-Screening Model for Demographic Bias
You receive: (a) inference API access to the production model (black-box), (b) a 12,000-resume audit benchmark with self-declared gender and age-band labels (consented, GDPR-com…
- Fairness Metrics
- Bias Auditing
- Model Evaluation
AI Ethics, Fairness, and Responsible AI - ResearchSeniorNew
Concept-Activation Vectors for an Autonomous-Vehicle Perception Audit
You receive a trained semantic-segmentation model (8 classes including pedestrian, vehicle, road, sky), an internal validation set of 2,500 driving frames, and a small concept-i…
- Tcav
- Concept Explanations
- Interpretability
Explainable and Interpretable AI - CodeIntermediateNew
Generate Synthetic Tabular Data with Privacy Guarantees
Implement DP synthetic data generation: either DP-CTGAN, PATE-GAN, or a marginal-based DP method like PrivBayes / MWEM. Train on the real dataset (around 200,000 transactions, 1…
- Synthetic Data
- Differential Privacy
- Generative Models
Privacy-Preserving Machine Learning - ResearchIntermediateNew
Audit a Public LLM Benchmark for Validity Threats
Choose one open LLM benchmark (e.g., MMLU, GPQA, BIG-Bench-Hard, MATH). Read the benchmark paper plus at least three follow-up critiques. Audit (1) data contamination risk again…
- Benchmark Evaluation
- Data Contamination Analysis
- Annotation Methodology
AI Measurement and Evaluation - ResearchIntermediateNew
Design a Capability Evaluation for an Open-Weights Coding Model
Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, …
- Capability Evaluation
- Safety Evaluation
- LLM Evaluation
AI Safety and Alignment - ResearchIntermediateNew
Red-Team a Customer-Service Chatbot for Jailbreak Resistance
Use a published taxonomy of jailbreak categories (prompt injection, persona override, encoded payloads, multi-turn escalation, refusal bypass, tool-misuse). For each category, d…
- Red Team Operations
- Jailbreak Analysis
- LLM Evaluation
AI Safety and Alignment - AnalysisIntermediateNew
Catastrophic-Forgetting Audit on a Domain Fine-Tune
You receive the fine-tuned 7B chemistry model and its base, plus a benchmark basket (MMLU subset, GSM8K, IFEval, a small instruction-following set). Run all 4 benchmarks on both…
- Catastrophic Forgetting
- LLM Evaluation
- Fine Tuning
Fine-Tuning Large Language Models - CodeIntermediateNew
Safety-Critical Test Harness for an AV Planner
Use CARLA (open-source AV simulator) and encode 10 representative safety scenarios across 3 categories (cut-in, pedestrian emergence, signalized-intersection right-of-way). Writ…
- Simulation
- Scenario Testing
- Safety Evaluation
AI for Autonomous Vehicles - ResearchIntermediateNew
Build Saliency-Map Explanations for Dermatology Triage
You receive a trained CNN (ResNet-50 backbone, 7-class lesion classifier) and a 1,000-image held-out test set with dermatologist labels. Implement Integrated Gradients, GradCAM,…
- Saliency Maps
- Integrated Gradients
- Gradcam
Explainable and Interpretable AI - AnalysisIntermediateNew
Run a Pre-Deployment Fairness + Drift Audit on a Hiring Model
You receive a trained classifier (joblib), the training data sample, and a held-out 'next-month' evaluation set. Compute group fairness metrics (false-positive-rate gap, true-po…
- Fairness Metrics
- Drift Detection
- Bias Mitigation
Machine Learning in Practice - AnalysisIntermediateNew
Chest-X-Ray Deployment Audit Across Hospital Sites
You receive (1) a vendor-supplied multi-label chest-X-ray classifier, (2) the current single-site held-out evaluation set, (3) a 12,000-image multi-site evaluation set with 14-f…
- Medical Imaging
- Classification
- Model Evaluation
Machine Learning for Imaging and Medical Image Analysis
How it works
From brief to credential, in six steps.
Step 01
Browse challenges aligned to your studies.
Step 02
Accept the one that fits your goals.
Step 03
Work through it with AI Copilot guidance.
Step 04
Submit for structured evaluation.
Step 05
Earn a verified credential.
Step 06
Add it to LinkedIn with one click.
Related roles you may want to explore
Browse all roles →AI Research
Applied AI Scientist
Applied AI scientists live in the productive tension between research papers and product roadmaps. The work is reproducing a result from arxiv on a Tuesday, then deciding by Thursday whether it can be adapted to a problem nobody else has framed yet. Days mix ablation studies, careful evaluation design, and conversations with engineers about what's realistic to ship. Good work here looks like an experiment that disproves your favorite hypothesis cleanly, then suggests a better one. Students grow into this role by treating PyTorch and Hugging Face Transformers as their lab bench and learning to write up findings the way a scientist would — with assumptions, limitations, and a path for the next person to extend the work.
AI Research
ML Researcher
What if attention worked differently? What if a smaller model, trained better, could match a much larger one? ML researchers chase questions like these for a living. The role exists to push the frontier of what models can do — through careful ablation studies, novel architectures, and the patient grind of running experiments that often disprove your favorite hypothesis. Days mix reading recent papers, sketching ideas, and writing JAX or PyTorch code that someone else will read in six months. Students grow into this path through reproducing published results before inventing their own, and learning to write up findings with intellectual honesty. The best researchers stay curious about why something worked, not just that it did.
AI Research
Research Scientist
What does a model actually learn, and can we prove it? Research scientists in AI labs spend their careers refining that question. The work alternates between long stretches of reading, careful ablation studies in PyTorch, and the rare moment when a benchmark moves and you understand why. CUDA kernels and diffusion model architectures sit in the toolkit, but the real currency is taste: knowing which experiment is worth a week of compute and which is a distraction. Students who thrive here tend to come from machine learning, physics, or pure math, and they read papers the way novelists read novels. Expect a long apprenticeship reproducing others' results before your own ideas earn a place at a top venue.
Industry teams behind a decade of practitioner briefs
Hiring from this pool?
Sponsor a challenge and meet candidates through actual work.
Industry teams can shape briefs around the skills they hire for, then evaluate students on rubric-scored deliverables — not resumes.
Skills and disciplines shown on this page are derived from the Ewance challenge catalogue. When the median annual salary is available for this role via Adzuna, it will be shown above with the sample size and country.
Portrait: Photo by Angelo Abear on Unsplash.



















































































