On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote: > On GNU/Linux systems, the only man page indexer I know of is Colin > Watson's man-db--specifically, its mandb(8) program. But it's nicely > designed so that the "topic and summary description extraction" task is > delegated to a standalone tool, lexgrog(1), and we can use that. > > $ lexgrog /tmp/proc_pid_fdinfo_mini.5 > /tmp/proc_pid_fdinfo_mini.5: parse failed > > Oh, damn. I wasn't expecting that. Maybe this is what defeats Michael > Kerrisk's scraper with respect to groff's man pages.[1] How embarrassing. Could somebody please file a bug on https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? (Of course there'll be a lead time for fixes to get into distributions.) > Well, I can find a silver lining here, because it gives me an even > better reason than I had to pitch an idea I've been kicking around for a > while. Why not enhance groff man(7) to support a mode where _it_ will > spit out the "Name"/"NAME" section, and only that, _for_ you? > > This would be as easy as checking for an option, say '-d EXTRACT=Name', > and having the package's "TH" and "SH" macro definitions divert > (literally, with the `di` request) everything _except_ the section of > interest to a diversion that is then never called/output. (This is > similar to an m4 feature known as the "black hole diversion".) > > All of the features necessary to implement this[2] were part of troff as > far as back as the birth of the man(7) package itself. It's not clear > to me why it wasn't done back in the 1980s. > > lexgrog(1) itself will of course have to stay around for years to come, > but this could take a significant distraction off of Colin's plate--I > believe I have seen him grumble about how much *roff syntax he has to > parse to have the feature be workable, and that's without upstart groff > maintainers exploring up to every boundary that existed even in 1979 and > cheerfully exercising their findings in man pages. lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if you focus on that then you'll end up with a design that's not very useful. What really matters is indexing the whole system's manual pages, and mandb(8) does not do that by invoking lexgrog(1) one page at a time, but rather by running more or less the same code in-process. I already know that getting acceptable performance for this requires care, as illustrated by one of the NEWS entries for man-db 2.10.0: * Significantly improve `mandb(8)` and `man -K` performance in the common case where pages are of moderate size and compressed using `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test system. ... so I'm prepared to bet that forking nroff one page at a time will be unacceptably slow. (This also combines with the fact that man-db applies some sandboxing when it's calling nroff just in case it might happen that a moderately-sized C++ project has less than 100% perfect security when doing text processing, which I'm sure everyone agrees would never happen.) If it were possible to run nroff over a whole batch of pages and get output for each of them in one go, then maaaaybe. man-db would need a reliable way to associate each line (or sometimes multiple lines) of output with each source file, and of course care would be needed around error handling and so on. I can see the appeal, in terms of processing the actual language rather than a pile of hacks that try to guess what to do with it - but on the other hand this starts to feel like a much less natural fit for the way nroff is run in every other situation, where you're processing one document at a time. Cheers, -- Colin Watson (he/him) [cjwatson@xxxxxxxxxx]