Neshise Insights / accessibility

Captioning, alt text, and the limits of automation

Automated captions and image descriptions have come a long way. They are not yet a substitute for a person who has read the room.

By Neshise

A teacher I admire tells the same story every term. She uploads a lecture video to the university platform. The platform’s auto-captioner runs overnight. The next morning, every instance of “neural network” has become “neutral network.” Every “bias” has become “bayes.” A student reads along, and the lesson quietly drifts out from under them.

This is not a story about a bad tool. The auto-captioner is, by most measures, very good. It is a story about what happens when “good enough on average” meets the specific human in front of the screen.

What automated transcription is good at

Modern speech-to-text systems are remarkable. They handle accents, overlapping speech, and noisy rooms in ways that would have been science fiction a decade ago. For everyday meetings, lecture notes, and personal voice memos, they work well enough that most users no longer think about them.

For accessibility — for someone whose primary access to spoken content is through reading — “well enough” carries a different cost. A 95% word-accurate transcript may still mangle the one technical term the entire lesson hinges on.

What it is still not good at

Three things consistently trip up automated systems in 2026:

  1. Domain vocabulary that overlaps with common words. “Bias” in machine learning, “vector” in graphics, “agent” in software design.
  2. Speaker turns in informal conversation. Two friends talking over each other become one confused monologue.
  3. Non-speech audio that carries meaning. A door slamming. A laugh. A long silence.

The same pattern holds for image descriptions. A model can describe a photo as “a group of people standing in a room.” A human describes it as “the team gathered around the prototype, looking at it for the first time.”

A workable middle path

The teacher I mentioned now uses the auto-caption as a draft and spends thirty minutes correcting it before publishing. Thirty minutes per lecture is a real cost; it is also less than the cost of redoing the entire lesson.

This is the pattern we recommend, broadly: treat the machine output as a first pass. Build review into the workflow. Pay people to do the part that requires having read the room.

Automation reduces the work. It does not yet replace the care.