Building an 11,000-case OET role-play bank
How we generated and audited 11,000+ OET role-play cases across 12 professions — the design constraints, what failed, and what the v2.2 audit pass taught us.
A bank of practice cases is only useful if every case is:
- Medically plausible for the profession (a nursing case can't have a dosage error; a vet case can't ask the candidate to prescribe a human medication)
- Pedagogically aligned with the OET rubric (each task should be assessable against at least one of the 9 criteria)
- Diverse enough that candidates aren't drilling the same scenario twice
- Tagged with the criteria the case strengthens so the recommender can target weak spots
Building one with humans is slow and expensive. Building one with AI is fast but produces fluent-sounding nonsense if you don't audit carefully. This post walks through what we did, what failed, and what the audit pipeline looks like in production.
v1 — generate first, audit later
The first version of the case bank was 6,082 cases generated against per-profession topic schemas. Each case had:
- Profession + topic + scenario type
- A candidate role-card (role, setting, tasks)
- A patient/roleplayer card (background, planned responses, key disclosures)
- An expected criteria-targeting profile
We shipped this version internally and immediately found problems. Across the bank, ~8.9% of cases had content-quality flags:
- Clinical implausibility (a paediatric oncology case asking the nurse to prescribe a dose)
- Tasks that overlapped with each other (two tasks asking effectively the same thing)
- Role-cards that the candidate could "solve" in 90 seconds, leaving 3.5 minutes of dead air
- Profession-scenario mismatches (a veterinary case where the patient was a human)
The honest version: 8.9% is too high to ship to candidates who are paying time to practice on it. A case that's broken sucks the value out of an entire study session.
What went wrong
A few specific failure modes traced back to generation prompt structure:
- The "diverse enough" criterion fought "medically plausible". When we asked for novelty, we got plausibility drift.
- The task list was generated independently of the role-card. Tasks sometimes asked the candidate to do things the role-card didn't set up.
- The profession topic schema was too coarse. "Cardiology" as a topic gave the model permission to generate cases that drifted out of typical nursing cardiology into doctor cardiology.
v2 — refactor with explicit constraints
We rewrote generation as a two-pass pipeline:
Pass 1: Profession-aware scenario plan
A short structured generation produces:
- Profession + sub-specialty + setting (e.g. "Nursing, community / outpatient")
- Patient demographic + presenting concern + 2-sentence clinical context
- Expected 9-criteria targeting profile
This pass is constrained heavily. Profession-specific topic schemas restrict the sub-specialty/topic combinations to ones we've manually whitelisted as in-scope.
Pass 2: Generate the cards against the plan
Given the plan, the second pass generates:
- The candidate role-card (5-task list + setting)
- The patient/roleplayer card with planned responses
Crucially, this pass cannot introduce a sub-specialty or topic not in the plan. Generation drift on plausibility dropped significantly.
v2.1 — the audit pipeline
For each generated case, we run a separate auditor model against a 9-point rubric:
| Audit check | What it verifies | |---|---| | Profession plausibility | Is this scenario plausible for this profession? | | Clinical safety | Are the dosages, medications, and red flags reasonable? | | Task independence | Do the 5 tasks each ask for different things? | | Task feasibility in 5 minutes | Can the candidate plausibly cover all tasks in the time allowed? | | Criteria-targeting | Does each task map to ≥1 of the 9 criteria? | | Role-card scope | Is the role-card scope manageable in 3 minutes of prep? | | Patient script consistency | Does the patient script align with the role-card disclosures? | | Cultural neutrality | Avoids region-specific assumptions that would unfairly disadvantage candidates from elsewhere | | Content uniqueness | Sufficiently different from cases already in the bank |
Cases that fail any one check are flagged for human review. Cases that pass go into the active bank.
After running the v2.1 audit on the v2 pass, the content-quality flag rate dropped from 8.9% to ~1.4%.
v2.2 — enrichment pass
The v2.1 cases were medically correct but sometimes felt dry on the candidate role-card side. v2.2 added an enrichment pass that adds:
- A 1-paragraph scenario context that's actually engaging
- A patient opening line in their voice
- Slightly richer task descriptions
This pass is purely additive — it never changes the medical content. The audit pipeline re-checks plausibility after enrichment.
After v2.2, the bank stands at 11,000+ cases, distributed across the 12 OET professions roughly in proportion to the candidate volume each profession sees on the exam.
What we'd do differently
Three things, in priority order:
- Start with the audit, not the generation. A working auditor is more valuable than a fancier generator. Get the audit checks rock solid; then iterate generation against them.
- Profession schemas need expert input. We initially derived schemas from public OET sample material. For some professions (veterinary, podiatry) the public material is thin and we ended up under-covering common scenarios. Better: collaborate with a working clinician in each profession to validate the schema.
- Audit re-runs should be cheap. Every prompt change to the auditor means re-auditing the whole bank. Make the re-audit pipeline a one-command operation. Ours wasn't initially and we paid for it in elapsed time.
What the bank looks like in production
When a candidate hits "Recommended" in OET Live, a weighted recommender picks a case from the bank based on:
- Your two weakest criteria (50% weight)
- Topics you haven't tried in the last 30 days (30% weight)
- Topics where your previous attempts scored below your overall average (20% weight)
The result is that the case you get is engineered to move your weakest band — not random.
Future passes
The next case-bank work:
- Per-country cultural register — some scenarios should land differently depending on whether the candidate is registering with UK NMC vs AHPRA. Right now we average across.
- Recency cycling — we want to retire cases that get over-practiced and refresh the bank quarterly.
- Calibration sets — a subset of cases scored by examiner-equivalent humans to validate the auto-scoring continues to track real OET scoring.
If you want to practice on the current bank, join the waitlist.
Further reading
- How AI patients change OET practice — sibling methodology post on the AI patient side
- How OET Speaking is actually scored — the rubric the audit pipeline tracks