At 2022-07-22T01:22:57+0100, Colin Watson wrote: > On Fri, Jul 22, 2022 at 01:16:49AM +0200, Alejandro Colomar wrote: > > On 7/21/22 20:36, G. Branden Robinson wrote: > > > At 2022-07-21T16:29:21+0200, Alejandro Colomar wrote: > > > > Also, does it have any functional implications? I'm especially > > > > interested in knowing if that may affect in any way the ability > > > > of man(1) to find a page when invoked as `man TIMESPEC` for > > > > example. > > > > > > My understanding is that mandb(8) indexes based solely on the > > > second argument to the `TH` macro call and (what it interprets as) > > > the contents of the "Name" (or "NAME") section of the page. It > > > parses *roff itself as best it can to determine this. So the fact > > > that the _first_ argument to `TH` might be in full caps doesn't > > > deter it. (It might in fact have made mandb(8) authors' job > > > easier if an "honest lettercase" practice had arisen back in the > > > day--but it didn't). [...] > > > Since he's a mandb(8) author/maintainer, I would again defer to > > > Colin Watson's knowledge and expertise in this area. [...] > > The above is not quite correct. man-db doesn't index on the .TH > section at all, and I don't believe I've encountered the practice of > doing so in other indexers (I could be wrong, but I think that's > something I would have remembered if I'd noticed it). Rather, it > parses the "NAME" (or "Name", or a number of localized variants) > section of pages using the man macro set for "foo \- description" > lines and uses the left-hand side of those for page names, or > equivalently looks for .Nm requests in pages using the mdoc macro set. Ah, thanks, Colin. A quick consultation of ncurses man pages reveals that mandb(8)'s idea of the manual section comes from its place in the directory hierarchy, not from parsing the arguments to the `TH` call. My error! > With the exception of handling localized variants of that section > name, which is a pretty ugly pile of special cases, I believe this to > be fairly traditional behaviour. I can't say I would have done it > that way if I'd been designing the system from scratch since it really > involves far too much half-arsed parsing, but it seemed to be the > usual thing to do when I came on the scene. We could have groff man(7) and mdoc(7) recognize a register, named `INDEX`, `DB`, or `SUMMARIZE` or something, which would cause the package(s) to emit the required information, derived solely from page content, in a desirable format. Say, JSON, maybe. Upon seeing this register and reporting the data, the package could then invoke `nx` to move to the next input file. Thus, potentially, the indexing data could be generated with great speed--you could call groff (or nroff, it wouldn't matter) with as many man page file arguments as desired, specifying no preprocessor options (except maybe those for preconv), and a large percentage of page content would never even be read, let alone formatted. Why, I wonder, was the thing not done this way in the first place? Possibly because what follows "Name" can be arbitrary roff language input. However... The "Name" section's contents can be stored in a diversion. In normal circumstances, this diversion's contents would be emitted immediately upon any other `SH` call (or, for degenerate pages that declare no sections after "Name", when the page's end macro is called[1]). Once in a diversion, these contents are subject to "sanitization", a feature I'm chewing over adding to the formatter.[2] The gist is that all the garbage (font changes, special character escape sequences) you currently spent time parsing or stripping away is already removed or transformed for you, leaving clean, printable ASCII or UTF-8. At this point I pause to let the wave of horror break over my audience. Regards, Branden [1] andoc.tmac contrives for this to be the case when rendering multiple pages. [2] https://savannah.gnu.org/bugs/?62787
Attachment:
signature.asc
Description: PGP signature