On Sat, Nov 02, 2024 at 07:50:23PM -0500, G. Branden Robinson wrote: > At 2024-11-02T19:06:53+0000, Colin Watson wrote: > > How embarrassing. Could somebody please file a bug on > > https://gitlab.com/man-db/man-db/-/issues to remind me to fix that? > > Done; <https://gitlab.com/man-db/man-db/-/issues/46>. Thanks, working on it. > > I already know that getting acceptable performance for > > this requires care, as illustrated by one of the NEWS entries for > > man-db 2.10.0: > > > > * Significantly improve `mandb(8)` and `man -K` performance in the > > common case where pages are of moderate size and compressed using > > `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test > > system. > > > > ... so I'm prepared to bet that forking nroff one page at a time will > > be unacceptably slow. > > Probably, but there is little reason to run nroff that way (as of groff > 1.23). It already works well, but I have ideas for further hardening > groff's man(7) and mdoc(7) packages such that they return to a > well-defined state when changing input documents. Being able to keep track of which output goes with which input pages is critical to the indexer, though (as you acknowledge later in your reply). It can't just throw the whole lot at nroff and call it a day. One other thing: mandb/lexgrog also looks for preprocessing filter hints in pages (`'\" te` and the like). This is obscure, to be sure, but either a replacement would need to do the same thing or we'd need to be certain that it's no longer required. > > and of course care would be needed around error handling and so on. > > I need to give this thought, too. What sorts of error scenarios do you > foresee? GNU troff itself, if it can't open a file to be formatted, > reports an error diagnostic and continues to the next `argv` string > until it reaches the end of input. That might be sufficient, or man-db might need to be able to detect which pages had errors. I'm not currently sure. > > but on the other hand this starts to feel like a much less natural fit > > for the way nroff is run in every other situation, where you're > > processing one document at a time. > > This I disagree with. Or perhaps more precisely, it's another example > of the exception (man(1)) swallowing the rule (nroff/troff). nroff and > troff were written as Unix filters; they read the standard input stream > (and/or argument list)[1], do some processing, and write to standard > output.[2] > > Historically, troff (or one of its preprocessors) was commonly used with > multiple input files to catenate them. But this application is not conceptually like catenation (even if it might be possible to implement it that way). The collection of all manual pages on a system is not like one long document that happens to be split over multiple files, certainly not from an indexer's point of view. -- Colin Watson (he/him) [cjwatson@xxxxxxxxxx]