AI Research
AI Safety Researcher
Think of this role as the loyal opposition inside an AI lab. While teammates race to make a model more capable, AI safety researchers ask what happens when it succeeds — at the wrong thing, for the wrong reasons, in the wrong hands.
The work spans red-teaming prompts, designing constitutional methods that nudge models toward principled behavior, and translating findings into guardrails that product teams can actually adopt. Good work here is rigorous and humble: it admits what's still unknown rather than papering over it.
Students grow into this path by pairing technical depth in PyTorch with reading widely across ethics, policy, and security. The field rewards people who can hold both at once.
- ResearchIntermediateNew
Red-Team a Customer-Service Chatbot for Jailbreak Resistance
Use a published taxonomy of jailbreak categories (prompt injection, persona override, encoded payloads, multi-turn escalation, refusal bypass, tool-misuse). For each category, d…
- Red Team Operations
- Jailbreak Analysis
- LLM Evaluation
AI Safety and Alignment - DesignIntermediateNew
Spec Trust-and-Safety Eval Harness for an LLM-Powered Customer-Support Bot
You will spec a 6-page evaluation harness covering: (1) jailbreak test set (about 200 prompts across 6 attack families), (2) PII-leakage probes (about 100 synthetic-customer pro…
- LLM Evaluation
- Red Team Operations
- Pii Detection
Trustworthy AI, Robustness, and Safety - ResearchIntermediateNew
Run an Adversarial-Robustness Audit on a Face-Liveness Model for a Fintech
You receive a stand-in face-liveness model with the same backbone as the production model plus a labeled evaluation set of 2,000 frames. Apply three standard digital attacks (FG…
- Adversarial Robustness Research
- Face Liveness
- Pytorch Or Tensorflow
Deep Learning for Computer Vision - ResearchIntermediateNew
Red-Team Evaluation of a Refusal Policy
You receive the lab's written refusal policy (version 2.3) and a starter set of 60 red-team prompts (10 per category). Extend the set to 240 prompts (40 per category) using docu…
- Red Team Operations
- Refusal Policy
- Alignment Evaluation
Machine Learning from Human Preferences (RLHF and Alignment) Practice your coursework on real scenarios.
Every challenge is shaped from real-world context — not generic exercises. The work mirrors what your degree prepares you for.
Why Ewance
- CodeIntermediateNew
Build an Evaluation Harness for an Internal LLM Assistant
You will design and implement an evaluation harness in Python that runs four test suites: (1) helpfulness (LLM-as-judge with rubric), (2) factual grounding (compare cited source…
- LLM Evaluation
- LLM As Judge
- Prompt Injection Testing
Large Language Models - CodeIntermediateNew
Generate Synthetic Tabular Data with Privacy Guarantees
Implement DP synthetic data generation: either DP-CTGAN, PATE-GAN, or a marginal-based DP method like PrivBayes / MWEM. Train on the real dataset (around 200,000 transactions, 1…
- Synthetic Data
- Differential Privacy
- Generative Models
Privacy-Preserving Machine Learning - ResearchIntermediateNew
Run an Alignment Probe on a Coding Assistant
You will design 240 probe prompts across 3 classes: (1) over-refusal (innocuous coding asks the model should fulfill), (2) insecure code patterns (asks where the model should wa…
- Red Team Operations
- Alignment Evaluation
- LLM Evaluation
Large Language Models - AnalysisIntermediateNew
Chest-X-Ray Deployment Audit Across Hospital Sites
You receive (1) a vendor-supplied multi-label chest-X-ray classifier, (2) the current single-site held-out evaluation set, (3) a 12,000-image multi-site evaluation set with 14-f…
- Medical Imaging
- Classification
- Model Evaluation
Machine Learning for Imaging and Medical Image Analysis - Browse challenges
Explore role
Product Manager
Ship product that solves real user problems. Combine user research, prototyping, and stakeholder alignment to turn ambiguous briefs into measurable wins — the role at the centre of modern software teams.
- ResearchIntermediateNew
Audit a Public LLM Benchmark for Validity Threats
Choose one open LLM benchmark (e.g., MMLU, GPQA, BIG-Bench-Hard, MATH). Read the benchmark paper plus at least three follow-up critiques. Audit (1) data contamination risk again…
- Benchmark Evaluation
- Data Contamination Analysis
- Annotation Methodology
AI Measurement and Evaluation - CodeIntermediateNew
Train a Differentially Private Classifier on Medical Records
Use Opacus (PyTorch DP-SGD library). Train a tabular classifier (small MLP + gradient-boosted features) with DP-SGD at the agreed epsilon/delta. Run an accuracy-vs-privacy front…
- Differential Privacy
- Dp Sgd
- Opacus
Privacy-Preserving Machine Learning - ResearchIntermediateNew
Design a Capability Evaluation for an Open-Weights Coding Model
Pick a recent open-weights coding model (e.g., a Qwen, DeepSeek, or Llama variant). Design an evaluation set of around 40 coding tasks across 4 buckets: standard benign coding, …
- Capability Evaluation
- Safety Evaluation
- LLM Evaluation
AI Safety and Alignment - ResearchIntermediateNew
Build Saliency-Map Explanations for Dermatology Triage
You receive a trained CNN (ResNet-50 backbone, 7-class lesion classifier) and a 1,000-image held-out test set with dermatologist labels. Implement Integrated Gradients, GradCAM,…
- Saliency Maps
- Integrated Gradients
- Gradcam
Explainable and Interpretable AI Build a verifiable portfolio.
Submissions become evidence. Reviewers with shipping experience score against a rubric; the result becomes a credential anyone can verify.
Why Ewance
- CodeIntermediateNew
Safety-Critical Test Harness for an AV Planner
Use CARLA (open-source AV simulator) and encode 10 representative safety scenarios across 3 categories (cut-in, pedestrian emergence, signalized-intersection right-of-way). Writ…
- Simulation
- Scenario Testing
- Safety Evaluation
AI for Autonomous Vehicles - CodeIntermediateNew
Constitutional AI Critique Loop for Hallucination Reduction
You receive the meal-planning prompts (60 test cases with dietary constraints), an unrevised baseline (single-pass instruction-tuned model), and an empty nutrition-constraint co…
- Constitutional Ai
- Self Critique
- Alignment Prompting
Machine Learning from Human Preferences (RLHF and Alignment) - AnalysisIntermediateNew
Catastrophic-Forgetting Audit on a Domain Fine-Tune
You receive the fine-tuned 7B chemistry model and its base, plus a benchmark basket (MMLU subset, GSM8K, IFEval, a small instruction-following set). Run all 4 benchmarks on both…
- Catastrophic Forgetting
- LLM Evaluation
- Fine Tuning
Fine-Tuning Large Language Models - ResearchIntermediateNew
Red-Team an Image-Classification Pipeline for a Banking KYC Workflow
You receive the production image classifier as a black-box API plus a labeled validation set of 5,000 ID images. Run untargeted FGSM and PGD attacks (L_inf budget 4/255 and 8/25…
- Adversarial Attacks
- Robust Evaluation
- Red Team Operations
Trustworthy AI, Robustness, and Safety - DesignIntermediateNew
Score Compliance Risk for an Enterprise AI Rollout Pipeline
You will design a compliance-risk scoring methodology covering 8 attributes (data residency, model provider, retention policy, PII handling, audit trail, encryption, third-party…
- Risk Scoring
- Compliance Modeling
- Decision Support Systems
Decision Support Systems and Decision Analysis - CodeIntermediateNew
Prompt-Injection Hardening for a Customer-Support Agent
You receive the current agent prompt, the pen-tester's 60-attack injection test set (direct prompt injection, indirect via doc content, refusal-bypass, and exfiltration), and a …
- Prompt Injection Defense
- System Prompt Design
- Red Team Operations
Prompt Engineering - AnalysisIntermediateNew
Run a Pre-Deployment Fairness + Drift Audit on a Hiring Model
You receive a trained classifier (joblib), the training data sample, and a held-out 'next-month' evaluation set. Compute group fairness metrics (false-positive-rate gap, true-po…
- Fairness Metrics
- Drift Detection
- Bias Mitigation
Machine Learning in Practice - ResearchIntermediateNew
Audit an Agentic Workflow for Safety Failures
Read the system's existing capability spec + tool-allow-list. Design 50+ adversarial inputs across categories: prompt-injection, tool-confusion, scope-escape (agent does somethi…
- Ai Red Teaming
- Agent Safety
- Prompt Injection
Multi-Agent Systems - ResearchIntermediateNew
Audit Recommender Filter Bubbles for a Civic Forum
You receive 90 days of impression logs (about 30 million recommendation events) tagged with content viewpoint labels (left-leaning, center, right-leaning, non-political) from an…
- Recommender Evaluation
- Diversity Metrics
- Audit Methodology
Social Network Analysis and Web Science - AnalysisIntermediateNew
Audit a Sepsis Early-Warning Model for Subgroup Performance
You receive a pre-trained vendor model, the training-data summary, and a held-out hospital-network evaluation set (about 18,000 ICU stays with sepsis labels). Compute AUROC + AU…
- Model Evaluation
- Fairness Metrics
- Model Calibration
Machine Learning for Healthcare and Biomedicine - ResearchIntermediateNew
Safety-Test a Customer-Service Agent for Adversarial Prompts
You receive a sandboxed instance of the agent (a tool-using LLM that can read account balances and open support tickets — both mocked). Design a red-team suite of at least 80 pr…
- Ai Agents
- Red Team Operations
- Adversarial Prompts
AI Agents and LLM-Based Agents - CodeIntermediateNew
Prototype Constitutional-AI Style Guardrails for an Internal Chatbot
Author a 'constitution' of 15 to 20 principles tailored to internal research use (no IP leakage, no off-label medical claims, no personnel-data fishing, etc.). Implement a criti…
- Constitutional Ai
- Alignment Techniques
- LLM Evaluation
AI Safety and Alignment - CodeIntermediateNew
RAG Faithfulness Evaluation for a Medical-Education Assistant
You receive 200 student-style questions, two RAG configurations (config A: vector-only + GPT-class generator; config B: hybrid + rerank + GPT-class generator), and the medical-t…
- RAG Evaluation
- Faithfulness
- LLM As Judge
Retrieval-Augmented Generation - CodeIntermediateNew
De-Identify Patient Images for a Pharma Research Pipeline
You receive 500 internal benchmark images (already cleared for use), each labelled with bounding boxes around face/tattoo/jewelry regions. Build a pipeline that detects these re…
- Image De Identification
- Object Detection
- Privacy Preserving Vision
Image Processing and Computational Imaging
How it works
From brief to credential, in six steps.
Step 01
Browse challenges aligned to your studies.
Step 02
Accept the one that fits your goals.
Step 03
Work through it with AI Copilot guidance.
Step 04
Submit for structured evaluation.
Step 05
Earn a verified credential.
Step 06
Add it to LinkedIn with one click.
Related roles you may want to explore
Browse all roles →AI Research
Applied AI Scientist
Applied AI scientists live in the productive tension between research papers and product roadmaps. The work is reproducing a result from arxiv on a Tuesday, then deciding by Thursday whether it can be adapted to a problem nobody else has framed yet. Days mix ablation studies, careful evaluation design, and conversations with engineers about what's realistic to ship. Good work here looks like an experiment that disproves your favorite hypothesis cleanly, then suggests a better one. Students grow into this role by treating PyTorch and Hugging Face Transformers as their lab bench and learning to write up findings the way a scientist would — with assumptions, limitations, and a path for the next person to extend the work.
AI Research
ML Researcher
What if attention worked differently? What if a smaller model, trained better, could match a much larger one? ML researchers chase questions like these for a living. The role exists to push the frontier of what models can do — through careful ablation studies, novel architectures, and the patient grind of running experiments that often disprove your favorite hypothesis. Days mix reading recent papers, sketching ideas, and writing JAX or PyTorch code that someone else will read in six months. Students grow into this path through reproducing published results before inventing their own, and learning to write up findings with intellectual honesty. The best researchers stay curious about why something worked, not just that it did.
AI Research
Research Scientist
What does a model actually learn, and can we prove it? Research scientists in AI labs spend their careers refining that question. The work alternates between long stretches of reading, careful ablation studies in PyTorch, and the rare moment when a benchmark moves and you understand why. CUDA kernels and diffusion model architectures sit in the toolkit, but the real currency is taste: knowing which experiment is worth a week of compute and which is a distraction. Students who thrive here tend to come from machine learning, physics, or pure math, and they read papers the way novelists read novels. Expect a long apprenticeship reproducing others' results before your own ideas earn a place at a top venue.
Industry teams behind a decade of practitioner briefs
Hiring from this pool?
Sponsor a challenge and meet candidates through actual work.
Industry teams can shape briefs around the skills they hire for, then evaluate students on rubric-scored deliverables — not resumes.
Skills and disciplines shown on this page are derived from the Ewance challenge catalogue. When the median annual salary is available for this role via Adzuna, it will be shown above with the sample size and country.
Portrait: Photo by Angelo Abear on Unsplash.



















































































