Inside our pronunciation feedback pipeline
How we score OET pronunciation — signals used, where engines disagree, and why we keep more than one running at the same time.
Pronunciation scoring is one of the higher-stakes parts of any OET feedback report. Get it right and a candidate has actionable detail to drill — "the word prophylaxis scored 38/100, and your stress was on the wrong syllable". Get it wrong and a candidate doubts the whole feedback report.
When we set out to build word-level pronunciation feedback into OET Live, we deliberately built the pipeline around a few design rules. This post walks through what those are, what they produce, and where the honest limits live.
What we score, what we don't
We score:
- Accuracy at the syllable level — did the candidate pronounce each syllable correctly?
- Completeness at the word level — was each word fully produced, or partially elided?
- Stress at the multi-syllable level — was the stress on the right syllable?
- Rhythm and fluency at the sentence level — was tempo and pausing natural?
We don't score:
- Accent reduction. OET assesses intelligibility, not "do you sound American/British". We deliberately don't penalise accent-typical substitutions that don't affect intelligibility.
- Conversational fillers ("um", "uh") as pronunciation errors. They're scored separately as fluency signals.
- Words the model is unsure it heard. Below a confidence threshold, the word is excluded from the report rather than guessed at.
Why we run more than one engine
We score every candidate audio through two independent pronunciation analysis services in parallel. Both are best-in-class. Both give per-word and per-syllable scores. They agree on about 78% of the words they flag as problematic — and the 22% disagreement is where the interesting decisions live.
Running both lets us:
- Cross-validate. If both engines agree a word was mispronounced, the candidate sees high confidence in the flag. If only one flags it, we mark it as low-confidence in the report.
- Combine strengths. One engine is stricter on multi-syllable clinical jargon. The other is better at prosody (rhythm + stress). We use the strict one for word-level accuracy, the prosody one for fluency feedback.
- Stay vendor-resilient. Speech APIs evolve. Pricing changes. By design we can swap either engine with a config flag if a better option appears.
What the engines disagree on
Three repeated patterns:
1. Multi-syllable clinical jargon
Words like "hypertension", "contraindication", "prophylaxis", "compliance": one engine is stricter, the other gives partial credit if most syllables land. For OET specifically, an examiner cares about getting the whole word right, so we lean on the stricter score for clinical vocabulary.
2. Connected speech reductions
Real speech connects words ("did you" → "didja"). One engine flags this as a deviation; the other tolerates it. The question is whether the reduction affects intelligibility. Most native English speakers reduce this same way, so we usually tolerate it and don't surface as a "mistake" in the report.
3. Accent-typical substitutions
Some accent-typical substitutions (Filipino [p]/[f], Indian [w]/[v], Brazilian rhotic patterns) are flagged by one engine and tolerated by the other. The OET intelligibility criterion doesn't penalise accent — it penalises unintelligibility. We weight the stricter score downward when the substitution is well-documented as accent-typical and doesn't cross into "you can't tell what the word is."
What we ship in the report
When you finish a role-play, the pronunciation section of the feedback report shows:
- A per-word "drill this" list, ranked by how much the word would hurt an examiner — top 5 by default
- For each word: the syllable that was scored wrong, the correct stress pattern, and an audio replay clip
- A per-criterion contribution of pronunciation to your overall intelligibility and fluency bands
This is the part of the report most candidates tell us they open first. That's intentional — pronunciation is the most actionable single thing you can drill between role-plays.
The honest limits
Two things our pipeline doesn't do yet:
- It doesn't account for prosody at the discourse level. Stress at the word level is solved; stress patterns across a 3-minute consultation (which words you emphasise to convey meaning) is not. That's a harder problem and we're working on it.
- It doesn't model listener effort. Two recordings with identical syllable-level scores can feel different to a listener — one harder to follow than the other. Some of that is captured by fluency scoring; some of it isn't.
What's next
We re-evaluate the pipeline every few months. The two-engine setup has held up well; we've added the prosody score to the primary report rather than burying it as a secondary signal. The next big change will probably be when multimodal AI reaches the accuracy of dedicated speech APIs at lower cost — likely 12-18 months out. When that arrives, we'll write about it here.
If you want to see what word-level pronunciation feedback looks like in practice, join the waitlist.
Further reading
- How OET Speaking is actually scored — the broader rubric context
- Building an 11,000-case bank — sister methodology post on how cases are made