making YouTube (etc) videos more accessible?

Rich Morin <rdm@xxxxxxxx> · Sun, 28 Jul 2024 10:37:27 -0700

The discussion of scanning and converting PDFs reminded me of a topic I've thought about a number of times.  Perhaps some folks here will be interested...

I watch a lot of conference presentations on YouTube (etc).   Typically, these have been edited into a collage, showing the speaker, a display screen, and perhaps some graphics for the event.  The display screen generally shows styled text, bullet points, charts and other graphic images, etc.

Although a blind person can listen to the audio track, they will miss all of the visual content.  So, I've wondered what the prospects might be for improving this situation.  For example, it seems like it should be possible for software to:

- pull static images from the video stream
- recognize the region containing the display screen
- extract text and layout information
- convert this to HTML
- synchronize the HTML to the audio track

Or, in this age of LLMs and such, perhaps the software could analyze the visual content and be prepared to discuss it interactively.   Might anyone know of any work in this area and/or have thoughts about how such a facility "should" work?  

-r

To unsubscribe from this group and stop receiving emails from it, send an email to blinux-list+unsubscribe@xxxxxxxxxx.