Blog

Building an 11,000-case OET role-play bank

How we generated and audited 11,000+ OET role-play cases across 12 professions — the design constraints, what failed, and what the v2.2 audit pass taught us.

5 min readBy OET Live

A bank of practice cases is only useful if every case is:

  1. Medically plausible for the profession (a nursing case can't have a dosage error; a vet case can't ask the candidate to prescribe a human medication)
  2. Pedagogically aligned with the OET rubric (each task should be assessable against at least one of the 9 criteria)
  3. Diverse enough that candidates aren't drilling the same scenario twice
  4. Tagged with the criteria the case strengthens so the recommender can target weak spots

Building one with humans is slow and expensive. Building one with AI is fast but produces fluent-sounding nonsense if you don't audit carefully. This post walks through what we did, what failed, and what the audit pipeline looks like in production.

v1 — generate first, audit later

The first version of the case bank was 6,082 cases generated against per-profession topic schemas. Each case had:

  • Profession + topic + scenario type
  • A candidate role-card (role, setting, tasks)
  • A patient/roleplayer card (background, planned responses, key disclosures)
  • An expected criteria-targeting profile

We shipped this version internally and immediately found problems. Across the bank, ~8.9% of cases had content-quality flags:

  • Clinical implausibility (a paediatric oncology case asking the nurse to prescribe a dose)
  • Tasks that overlapped with each other (two tasks asking effectively the same thing)
  • Role-cards that the candidate could "solve" in 90 seconds, leaving 3.5 minutes of dead air
  • Profession-scenario mismatches (a veterinary case where the patient was a human)

The honest version: 8.9% is too high to ship to candidates who are paying time to practice on it. A case that's broken sucks the value out of an entire study session.

What went wrong

A few specific failure modes traced back to generation prompt structure:

  1. The "diverse enough" criterion fought "medically plausible". When we asked for novelty, we got plausibility drift.
  2. The task list was generated independently of the role-card. Tasks sometimes asked the candidate to do things the role-card didn't set up.
  3. The profession topic schema was too coarse. "Cardiology" as a topic gave the model permission to generate cases that drifted out of typical nursing cardiology into doctor cardiology.

v2 — refactor with explicit constraints

We rewrote generation as a two-pass pipeline:

Pass 1: Profession-aware scenario plan

A short structured generation produces:

  • Profession + sub-specialty + setting (e.g. "Nursing, community / outpatient")
  • Patient demographic + presenting concern + 2-sentence clinical context
  • Expected 9-criteria targeting profile

This pass is constrained heavily. Profession-specific topic schemas restrict the sub-specialty/topic combinations to ones we've manually whitelisted as in-scope.

Pass 2: Generate the cards against the plan

Given the plan, the second pass generates:

  • The candidate role-card (5-task list + setting)
  • The patient/roleplayer card with planned responses

Crucially, this pass cannot introduce a sub-specialty or topic not in the plan. Generation drift on plausibility dropped significantly.

v2.1 — the audit pipeline

For each generated case, we run a separate auditor model against a 9-point rubric:

| Audit check | What it verifies | |---|---| | Profession plausibility | Is this scenario plausible for this profession? | | Clinical safety | Are the dosages, medications, and red flags reasonable? | | Task independence | Do the 5 tasks each ask for different things? | | Task feasibility in 5 minutes | Can the candidate plausibly cover all tasks in the time allowed? | | Criteria-targeting | Does each task map to ≥1 of the 9 criteria? | | Role-card scope | Is the role-card scope manageable in 3 minutes of prep? | | Patient script consistency | Does the patient script align with the role-card disclosures? | | Cultural neutrality | Avoids region-specific assumptions that would unfairly disadvantage candidates from elsewhere | | Content uniqueness | Sufficiently different from cases already in the bank |

Cases that fail any one check are flagged for human review. Cases that pass go into the active bank.

After running the v2.1 audit on the v2 pass, the content-quality flag rate dropped from 8.9% to ~1.4%.

v2.2 — enrichment pass

The v2.1 cases were medically correct but sometimes felt dry on the candidate role-card side. v2.2 added an enrichment pass that adds:

  • A 1-paragraph scenario context that's actually engaging
  • A patient opening line in their voice
  • Slightly richer task descriptions

This pass is purely additive — it never changes the medical content. The audit pipeline re-checks plausibility after enrichment.

After v2.2, the bank stands at 11,000+ cases, distributed across the 12 OET professions roughly in proportion to the candidate volume each profession sees on the exam.

What we'd do differently

Three things, in priority order:

  1. Start with the audit, not the generation. A working auditor is more valuable than a fancier generator. Get the audit checks rock solid; then iterate generation against them.
  2. Profession schemas need expert input. We initially derived schemas from public OET sample material. For some professions (veterinary, podiatry) the public material is thin and we ended up under-covering common scenarios. Better: collaborate with a working clinician in each profession to validate the schema.
  3. Audit re-runs should be cheap. Every prompt change to the auditor means re-auditing the whole bank. Make the re-audit pipeline a one-command operation. Ours wasn't initially and we paid for it in elapsed time.

What the bank looks like in production

When a candidate hits "Recommended" in OET Live, a weighted recommender picks a case from the bank based on:

  • Your two weakest criteria (50% weight)
  • Topics you haven't tried in the last 30 days (30% weight)
  • Topics where your previous attempts scored below your overall average (20% weight)

The result is that the case you get is engineered to move your weakest band — not random.

Future passes

The next case-bank work:

  • Per-country cultural register — some scenarios should land differently depending on whether the candidate is registering with UK NMC vs AHPRA. Right now we average across.
  • Recency cycling — we want to retire cases that get over-practiced and refresh the bank quarterly.
  • Calibration sets — a subset of cases scored by examiner-equivalent humans to validate the auto-scoring continues to track real OET scoring.

If you want to practice on the current bank, join the waitlist.

Further reading

Want the rest of the story? Get on the waitlist.

We email everyone on the waitlist when there's something worth showing — and stay quiet when there isn't.