Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

Colin Watson <cjwatson@xxxxxxxxxx> · Sat, 2 Nov 2024 19:06:53 +0000

On Sat, Nov 02, 2024 at 05:08:37AM -0500, G. Branden Robinson wrote:
> On GNU/Linux systems, the only man page indexer I know of is Colin
> Watson's man-db--specifically, its mandb(8) program.  But it's nicely
> designed so that the "topic and summary description extraction" task is
> delegated to a standalone tool, lexgrog(1), and we can use that.
> 
> $ lexgrog /tmp/proc_pid_fdinfo_mini.5
> /tmp/proc_pid_fdinfo_mini.5: parse failed
> 
> Oh, damn.  I wasn't expecting that.  Maybe this is what defeats Michael
> Kerrisk's scraper with respect to groff's man pages.[1]

How embarrassing.  Could somebody please file a bug on
https://gitlab.com/man-db/man-db/-/issues to remind me to fix that?  (Of
course there'll be a lead time for fixes to get into distributions.)

> Well, I can find a silver lining here, because it gives me an even
> better reason than I had to pitch an idea I've been kicking around for a
> while.  Why not enhance groff man(7) to support a mode where _it_ will
> spit out the "Name"/"NAME" section, and only that, _for_ you?
> 
> This would be as easy as checking for an option, say '-d EXTRACT=Name',
> and having the package's "TH" and "SH" macro definitions divert
> (literally, with the `di` request) everything _except_ the section of
> interest to a diversion that is then never called/output.  (This is
> similar to an m4 feature known as the "black hole diversion".)
> 
> All of the features necessary to implement this[2] were part of troff as
> far as back as the birth of the man(7) package itself.  It's not clear
> to me why it wasn't done back in the 1980s.
> 
> lexgrog(1) itself will of course have to stay around for years to come,
> but this could take a significant distraction off of Colin's plate--I
> believe I have seen him grumble about how much *roff syntax he has to
> parse to have the feature be workable, and that's without upstart groff
> maintainers exploring up to every boundary that existed even in 1979 and
> cheerfully exercising their findings in man pages.

lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if
you focus on that then you'll end up with a design that's not very
useful.  What really matters is indexing the whole system's manual
pages, and mandb(8) does not do that by invoking lexgrog(1) one page at
a time, but rather by running more or less the same code in-process.  I
already know that getting acceptable performance for this requires care,
as illustrated by one of the NEWS entries for man-db 2.10.0:

 * Significantly improve `mandb(8)` and `man -K` performance in the common
   case where pages are of moderate size and compressed using `zlib`: `mandb
   -c` goes from 344 seconds to 10 seconds on a test system.

... so I'm prepared to bet that forking nroff one page at a time will be
unacceptably slow.  (This also combines with the fact that man-db
applies some sandboxing when it's calling nroff just in case it might
happen that a moderately-sized C++ project has less than 100% perfect
security when doing text processing, which I'm sure everyone agrees
would never happen.)

If it were possible to run nroff over a whole batch of pages and get
output for each of them in one go, then maaaaybe.  man-db would need a
reliable way to associate each line (or sometimes multiple lines) of
output with each source file, and of course care would be needed around
error handling and so on.  I can see the appeal, in terms of processing
the actual language rather than a pile of hacks that try to guess what
to do with it - but on the other hand this starts to feel like a much
less natural fit for the way nroff is run in every other situation,
where you're processing one document at a time.

Cheers,

-- 
Colin Watson (he/him)                              [cjwatson@xxxxxxxxxx]