Re: All caps .TH page title

"G. Branden Robinson" <g.branden.robinson@xxxxxxxxx> · Thu, 21 Jul 2022 20:34:35 -0500

At 2022-07-22T01:22:57+0100, Colin Watson wrote:
> On Fri, Jul 22, 2022 at 01:16:49AM +0200, Alejandro Colomar wrote:
> > On 7/21/22 20:36, G. Branden Robinson wrote:
> > > At 2022-07-21T16:29:21+0200, Alejandro Colomar wrote:
> > > > Also, does it have any functional implications?  I'm especially
> > > > interested in knowing if that may affect in any way the ability
> > > > of man(1) to find a page when invoked as `man TIMESPEC` for
> > > > example.
> > > 
> > > My understanding is that mandb(8) indexes based solely on the
> > > second argument to the `TH` macro call and (what it interprets as)
> > > the contents of the "Name" (or "NAME") section of the page.  It
> > > parses *roff itself as best it can to determine this.  So the fact
> > > that the _first_ argument to `TH` might be in full caps doesn't
> > > deter it.  (It might in fact have made mandb(8) authors' job
> > > easier if an "honest lettercase" practice had arisen back in the
> > > day--but it didn't).
[...]
> > > Since he's a mandb(8) author/maintainer, I would again defer to
> > > Colin Watson's knowledge and expertise in this area.
[...]
> 
> The above is not quite correct.  man-db doesn't index on the .TH
> section at all, and I don't believe I've encountered the practice of
> doing so in other indexers (I could be wrong, but I think that's
> something I would have remembered if I'd noticed it).  Rather, it
> parses the "NAME" (or "Name", or a number of localized variants)
> section of pages using the man macro set for "foo \- description"
> lines and uses the left-hand side of those for page names, or
> equivalently looks for .Nm requests in pages using the mdoc macro set.

Ah, thanks, Colin.  A quick consultation of ncurses man pages reveals
that mandb(8)'s idea of the manual section comes from its place in the
directory hierarchy, not from parsing the arguments to the `TH` call.
My error!

> With the exception of handling localized variants of that section
> name, which is a pretty ugly pile of special cases, I believe this to
> be fairly traditional behaviour.  I can't say I would have done it
> that way if I'd been designing the system from scratch since it really
> involves far too much half-arsed parsing, but it seemed to be the
> usual thing to do when I came on the scene.

We could have groff man(7) and mdoc(7) recognize a register, named
`INDEX`, `DB`, or `SUMMARIZE` or something, which would cause the
package(s) to emit the required information, derived solely from page
content, in a desirable format.  Say, JSON, maybe.  Upon seeing this
register and reporting the data, the package could then invoke `nx` to
move to the next input file.

Thus, potentially, the indexing data could be generated with great
speed--you could call groff (or nroff, it wouldn't matter) with as many
man page file arguments as desired, specifying no preprocessor options
(except maybe those for preconv), and a large percentage of page content
would never even be read, let alone formatted.

Why, I wonder, was the thing not done this way in the first place?
Possibly because what follows "Name" can be arbitrary roff language
input.  However...

The "Name" section's contents can be stored in a diversion.  In normal
circumstances, this diversion's contents would be emitted immediately
upon any other `SH` call (or, for degenerate pages that declare no
sections after "Name", when the page's end macro is called[1]).

Once in a diversion, these contents are subject to "sanitization", a
feature I'm chewing over adding to the formatter.[2]  The gist is that
all the garbage (font changes, special character escape sequences) you
currently spent time parsing or stripping away is already removed or
transformed for you, leaving clean, printable ASCII or UTF-8.

At this point I pause to let the wave of horror break over my audience.

Regards,
Branden

[1] andoc.tmac contrives for this to be the case when rendering multiple
    pages.
[2] https://savannah.gnu.org/bugs/?62787
Attachment:
signature.asc

Description: PGP signature