Hi Colin, At 2024-11-02T19:06:53+0000, Colin Watson wrote: > How embarrassing. Could somebody please file a bug on > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? Done; <https://gitlab.com/man-db/man-db/-/issues/46>. > lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if > you focus on that then you'll end up with a design that's not very > useful. What really matters is indexing the whole system's manual > pages, and mandb(8) does not do that by invoking lexgrog(1) one page > at a time, but rather by running more or less the same code > in-process. Ah, I see it now--"lexgrog.l" is in both the Automake macros "lexgrog_SOURCES" and "mandb_SOURCES". Nice and DRY! > I already know that getting acceptable performance for > this requires care, as illustrated by one of the NEWS entries for > man-db 2.10.0: > > * Significantly improve `mandb(8)` and `man -K` performance in the > common case where pages are of moderate size and compressed using > `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test > system. > > ... so I'm prepared to bet that forking nroff one page at a time will > be unacceptably slow. Probably, but there is little reason to run nroff that way (as of groff 1.23). It already works well, but I have ideas for further hardening groff's man(7) and mdoc(7) packages such that they return to a well-defined state when changing input documents. > (This also combines with the fact that man-db applies some sandboxing > when it's calling nroff just in case it might happen that a > moderately-sized C++ project has less than 100% perfect security when > doing text processing, which I'm sure everyone agrees would never > happen.) Inconceivable, yes! But fortunately you can run nroff over N documents and pay its own startup overhead costs as well as those of sandboxing only once. > If it were possible to run nroff over a whole batch of pages and get > output for each of them in one go, then maaaaybe. That's already true for formatting the entire page. It's how this was created. https://www.gnu.org/software/groff/manual/groff-man-pages.utf8.txt (...best viewed with "less -R") With the `-d EXTRACT` feature I have in mind, in its as-simple-as-possible first-cut form, the problem you anticipate... > man-db would need a reliable way to associate each line (or sometimes > multiple lines) of output with each source file, ...would remain. I'll have to think of a good way to write out "metadata" (the input file name and the arguments to the `TH` request) as each page is encountered, and of an interface to enable that. I don't see it happening before groff 1.25. > and of course care would be needed around error handling and so on. I need to give this thought, too. What sorts of error scenarios do you foresee? GNU troff itself, if it can't open a file to be formatted, reports an error diagnostic and continues to the next `argv` string until it reaches the end of input. > I can see the appeal, in terms of processing the actual language > rather than a pile of hacks that try to guess what to do with it ...a major selling point, IMO... > but on the other hand this starts to feel like a much less natural fit > for the way nroff is run in every other situation, where you're > processing one document at a time. This I disagree with. Or perhaps more precisely, it's another example of the exception (man(1)) swallowing the rule (nroff/troff). nroff and troff were written as Unix filters; they read the standard input stream (and/or argument list)[1], do some processing, and write to standard output.[2] Historically, troff (or one of its preprocessors) was commonly used with multiple input files to catenate them. Here's an example of this practice from 1980. https://minnie.tuhs.org/cgi-bin/utree.pl?file=3BSD/usr/doc/pascal/makefile Regards, Branden [1] ...including this option from Seventh Edition Unix (1979) or earlier, which survives in GNU troff to this day. -i Read standard input after the input files are exhausted. [2] Seventh Edition troff didn't write to stdout by default, but tried to open the typesetter device. But it had an option to write to standard output. -t Direct output to the standard output instead of the phototypesetter. Running old school Unix under emulation these days, you _have_ to use this option to avoid the dreaded "Typesetter busy." diagnostic. When Kernighan refactored troff for device-independence, he reseated it more squarely in the Unix filter tradition by writing its plain-text page description language to stdout. The output driver, such as "dpost" for PostScript, also read its standard input, and could thus become just one more stage in a pipeline. [CSTR #97]
Attachment:
signature.asc
Description: PGP signature