Neshise Insights / accessibility
Captioning, alt text, and the limits of automation
Automated captions and image descriptions have come a long way. They are not yet a substitute for a person who has read the room.
A teacher I admire tells the same story every term. She uploads a lecture video to the university platform. The platform’s auto-captioner runs overnight. The next morning, every instance of “neural network” has become “neutral network.” Every “bias” has become “bayes.” A student reads along, and the lesson quietly drifts out from under them.
This is not a story about a bad tool. The auto-captioner is, by most measures, very good. It is a story about what happens when “good enough on average” meets the specific human in front of the screen.
What automated transcription is good at
Modern speech-to-text systems are remarkable. They handle accents, overlapping speech, and noisy rooms in ways that would have been science fiction a decade ago. For everyday meetings, lecture notes, and personal voice memos, they work well enough that most users no longer think about them.
For accessibility — for someone whose primary access to spoken content is through reading — “well enough” carries a different cost. A 95% word-accurate transcript may still mangle the one technical term the entire lesson hinges on.
What it is still not good at
Three things consistently trip up automated systems in 2026:
- Domain vocabulary that overlaps with common words. “Bias” in machine learning, “vector” in graphics, “agent” in software design.
- Speaker turns in informal conversation. Two friends talking over each other become one confused monologue.
- Non-speech audio that carries meaning. A door slamming. A laugh. A long silence.
The same pattern holds for image descriptions. A model can describe a photo as “a group of people standing in a room.” A human describes it as “the team gathered around the prototype, looking at it for the first time.”
A workable middle path
The teacher I mentioned now uses the auto-caption as a draft and spends thirty minutes correcting it before publishing. Thirty minutes per lecture is a real cost; it is also less than the cost of redoing the entire lesson.
This is the pattern we recommend, broadly: treat the machine output as a first pass. Build review into the workflow. Pay people to do the part that requires having read the room.
Automation reduces the work. It does not yet replace the care.