Re: [PATCH v2 1/3] proc_pid_fdinfo.5: Reduce indent for most of the page

"G. Branden Robinson" <g.branden.robinson@xxxxxxxxx> · Sat, 2 Nov 2024 19:50:23 -0500

Hi Colin,

At 2024-11-02T19:06:53+0000, Colin Watson wrote:
> How embarrassing.  Could somebody please file a bug on
> https://gitlab.com/man-db/man-db/-/issues to remind me to fix that?

Done; <https://gitlab.com/man-db/man-db/-/issues/46>.

> lexgrog(1) is a useful (if oddly-named, sorry) debugging tool, but if
> you focus on that then you'll end up with a design that's not very
> useful.  What really matters is indexing the whole system's manual
> pages, and mandb(8) does not do that by invoking lexgrog(1) one page
> at a time, but rather by running more or less the same code
> in-process.

Ah, I see it now--"lexgrog.l" is in both the Automake macros
"lexgrog_SOURCES" and "mandb_SOURCES".  Nice and DRY!

> I already know that getting acceptable performance for
> this requires care, as illustrated by one of the NEWS entries for
> man-db 2.10.0:
> 
>  * Significantly improve `mandb(8)` and `man -K` performance in the
>    common case where pages are of moderate size and compressed using
>    `zlib`: `mandb -c` goes from 344 seconds to 10 seconds on a test
>    system.
> 
> ... so I'm prepared to bet that forking nroff one page at a time will
> be unacceptably slow.

Probably, but there is little reason to run nroff that way (as of groff
1.23).  It already works well, but I have ideas for further hardening
groff's man(7) and mdoc(7) packages such that they return to a
well-defined state when changing input documents.

> (This also combines with the fact that man-db applies some sandboxing
> when it's calling nroff just in case it might happen that a
> moderately-sized C++ project has less than 100% perfect security when
> doing text processing, which I'm sure everyone agrees would never
> happen.)

Inconceivable, yes!  But fortunately you can run nroff over N documents
and pay its own startup overhead costs as well as those of sandboxing
only once.

> If it were possible to run nroff over a whole batch of pages and get
> output for each of them in one go, then maaaaybe.

That's already true for formatting the entire page.  It's how this was
created.

https://www.gnu.org/software/groff/manual/groff-man-pages.utf8.txt

(...best viewed with "less -R")

With the `-d EXTRACT` feature I have in mind, in its
as-simple-as-possible first-cut form, the problem you anticipate...

> man-db would need a reliable way to associate each line (or sometimes
> multiple lines) of output with each source file,

...would remain.  I'll have to think of a good way to write out
"metadata" (the input file name and the arguments to the `TH` request)
as each page is encountered, and of an interface to enable that.  I
don't see it happening before groff 1.25.

> and of course care would be needed around error handling and so on.

I need to give this thought, too.  What sorts of error scenarios do you
foresee?  GNU troff itself, if it can't open a file to be formatted,
reports an error diagnostic and continues to the next `argv` string
until it reaches the end of input.

> I can see the appeal, in terms of processing the actual language
> rather than a pile of hacks that try to guess what to do with it

...a major selling point, IMO...

> but on the other hand this starts to feel like a much less natural fit
> for the way nroff is run in every other situation, where you're
> processing one document at a time.

This I disagree with.  Or perhaps more precisely, it's another example
of the exception (man(1)) swallowing the rule (nroff/troff).  nroff and
troff were written as Unix filters; they read the standard input stream
(and/or argument list)[1], do some processing, and write to standard
output.[2]

Historically, troff (or one of its preprocessors) was commonly used with
multiple input files to catenate them.

Here's an example of this practice from 1980.

https://minnie.tuhs.org/cgi-bin/utree.pl?file=3BSD/usr/doc/pascal/makefile

Regards,
Branden

[1] ...including this option from Seventh Edition Unix (1979) or
    earlier, which survives in GNU troff to this day.

     -i     Read standard input after the input files are
            exhausted.

[2] Seventh Edition troff didn't write to stdout by default, but tried
    to open the typesetter device.  But it had an option to write to
    standard output.

     -t     Direct output to the standard output instead of the
            phototypesetter.

   Running old school Unix under emulation these days, you _have_ to use
   this option to avoid the dreaded "Typesetter busy." diagnostic.

   When Kernighan refactored troff for device-independence, he
   reseated it more squarely in the Unix filter tradition by writing
   its plain-text page description language to stdout.  The output
   driver, such as "dpost" for PostScript, also read its standard input,
   and could thus become just one more stage in a pipeline.  [CSTR #97]
Attachment:
signature.asc

Description: PGP signature