groff features for hyperlinked man pages (was: No 6.05/.01 pdf book available)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[looping in groff@gnu because I'm talking about GNU troff internals and
development plans I have]

At 2023-08-14T22:22:16+0100, Deri wrote:
> On Monday, 14 August 2023 21:01:46 BST Alejandro Colomar wrote:
> > On 2023-08-14 19:37, Alejandro Colomar wrote:
> > >> Another change which would need to be accepted is to allow a
> > >> fourth parameter to .MR which is the destination name.  Normally
> > >> the name of the destination is derived from the first two
> > >> parameters concatenated with "_", but if the name part of the .MR
> > >> call to the man page includes non- ascii characters (such as ".MR
> > >> my\-lovely\-page 7 ,") then it needs to provide a "clean"
> > >> destination name.

(linux-man readers might want to check out here, except for one
clarification about `MR` below.  The presentation below is mostly about
macro package and formatter internals.)

Yes.  This sort of thing has been undertaken a couple of times in macro
packages for groff.  Keith Marshall's pdfmark and your "an:cln" in the
"deri-gropdf-ng" branch are two examples.  I also had a couple of cracks
at it in my new feature in 1.23 for abbreviating pieces of man page
headers/footers if they would overrun other parts of it.[1]

These things are painful and fragile to implement.  We need a Better
Way(tm).  So I've proposed one and have roughed most of it out in my Git
stash.  I've mentioned it before on the groff list.  What we need is a
*roff string iterator.[2]  I think it's likely that implementing one
will also help us with other problems.[3][4]

Because strings, macros, and diversions can be interchanged, or at least
punned,[5] I think we'll also need a new conditional expression operator
to identify iterands that aren't ordinary or special characters.  The
only application of this that I envision, at least in early days, is so
that a macro iterating over them could skip them, ignore them, or throw
an error.  Already we have problems when people apply the cruder tools
GNU troff has for string manipulation to things that aren't
representable as individual characters within a string.[6]

With a general string/macro/diversion iterator, we can retire a lot of
the existing GNU troff requests that deal with strings, replacing them
with macros using the iterator request, which I intend to call `for`.  A
reverse iterator, `rfor`, may also be necessary.

Using the iterator request, in a file called "string.tmac" maybe,
`length`, `chop`, `substring`, `stringup`, and `stringdown` could all
become macros.  It would be easy to add `index` and `rindex` (after the
old BSD string library functions, now superseded by strchr() strrchr()),
which macro programmers have requested repeatedly over the years.

Often, just searching through a string for a character is all a macro
wants to do.  It's always been absurdly baroque to do that in GNU troff.
In AT&T troff, it was nigh impossible.  (If I don't say "nigh", Tadziu
Hoffman will appear in a puff of brimstone, arrayed in mask and cloak,
to show us some obfuscated but brilliant hack to get it done.  Okay,
forget I said "nigh".  ;-) )

> > I just re-read this, and am confused.  '\-' is an ASCII character,
> > isn't it?  In fact, all of the Linux man-pages pathnames are
> > composed exclusively of ASCII characters, aren't they?

You're thinking about this at the wrong level, Alex.  `\-` is a *roff
special character.  Unless converted to something else by character
translation or character definition,[7] it goes to the
device-independent page description language as a special character too.

Here's a quick glimpse of device-independent troff output, albeit
without helpful italics.

    c c  Typeset the glyph of the ordinary character c.  The drawing
         position is not advanced.

    C id⟨whitespace⟩
         Typeset the glyph of the special character id.  Trailing
         syntactical space is necessary to allow special character names
         of arbitrary length.  The drawing position is not advanced.

Consider a three-character *roff input document.

printf -- '-\\-\n' | groff -Tascii -Z

The part where the glyphs are written out looks like this.[8]

c-
C\-

It is up to the output device to decide what to do with that.  groff's
"ascii" and "latin1" output devices put out a U+002D character; its
"utf8" device puts out a minus sign, U+2212.  Now, before anyone
defecates a brick about the U+2212 not being easily greppable, nor
useful for copying and pasting to a shell prompt, the man(7) and mdoc(7)
macro packages override that.

  In man pages (only), groff maps the minus sign special character '\-'
  to the Basic Latin hyphen-minus (U+002D) because man pages require
  this glyph and there is no historically established *roff input
  character, ordinary or special, for obtaining it when a hyphen and
  minus sign are both separately available.  To obtain a true minus
  sign, use the special character escape sequences '\(mi' or '\[mi]'.[9]

> ... If you look at the keyrings(7) man page you see examples such as:-
> 
> .BR persistent\-keyring (7) ,
[...]
> Which when converted to .MR calls looks like:-
[...]
> .MR "persistent\-keyring" "7" "," "persistent-keyring"

Urp.  No, it doesn't.  Not unless you changed `MR` in deri-gropdf-ng.

.BR persistent\-keyring (7) ,

when converted to an `MR` call, looks like this.

.MR persistent\-keyring 7 ,

I expect man page authors would violently protest if they were told they
had to type all those quotes and, worse, repeat the name of the page.

One of the selling points of `MR` is less typing (no parentheses).  It
is hard enough to sell that macro on the linux-man list without
inaccurate claims entering the fray.

Now, if I understand correctly, is quite possible that something you're
doing in your branch is having `MR` call another macro internally to
prepare a hyperlink with some "anchor"--I won't say "node" because
collides with GNU troff internal jargon--information.  (This is
suggested by the heavy quoting you showed, since when macros call each
other with arbitrary numbers of arguments, and those arguments need to
be kept separate in the callee, the caller should use the `\$@` escape
sequence, which is analogous to the POSIX shell's `$@`.)

In your example above we see the "\-" special character getting
converted to an ordinary one, "-".  That's totally fine and is the sort
of thing I want to make easy--fundamentally with the `for` string
iterator request and practically with an `index` (or `strchr`, or
`bikeshed`) macro.

> On the keyrings(7) page in the pdf you should be able to see the
> difference between HYPHEN (U+2010), which is what \- becomes, and
> HYPHEN-MINUS (U+002D) which is the ascii character. The problem is
> that the MR request is a bit naughty in that it uses the first two
> parameters for two purposes, first it is used as a destination, but it
> is also output as text.

I don't think it's naughty; I think that by and large, man page authors
don't care to give "anchor names" to elements of their document.  They
want the macro package to figure it out.  I think one reason--maybe the
only reason--people are getting a glimpse inside the sausage factory of
GNU troff internals is because we haven't had a defined mechanism for
getting character data to an output device that is neither (1) intended
for formatting (writing visible glyphs) nor (2) in the printable ASCII
(Unicode Basic Latin) character set.  That's the aforementioned Savannah
#63074.[3]

Looking farther ahead, I think a further step is required if we're going
to have intra-page links; we're going to have to have a way to
disambiguate duplicates.  In practice there's not much risk from having
duplicate section titles in man pages, but I reckon a big, complex page
could duplicate subsection titles.  And if we automatically generate
hyperlink tags for paragraph tags, those would likely need it as well.
Maybe representing such internal anchors hierarchically will be enough:
"section_subsection_tag" or something like that.  I'm confident this
problem has been robustly solved elsewhere, and that all we need to do
is identify, adopt, and adapt a known good solution.

> So as text it may contain escapes to enhance the typography, for
> example using \- for a better looking hyphen. It is not my job to
> impose artificial restrictions on how a man page author wants his
> creation to look, better to separate the two purposes, so there is no
> restriction.

Agreed.

> > > Is this really needed?  Can't gropdf just translate them
> > > internally?  Say, do unconditionally the equivalent of `| tr - _
> > > |` or something like that.
> > > 
> > > [...]
> 
> This is all happening in groff macros way before it gets to gropdf.

I hope I have shed some light on this proces.

At 2023-08-14T22:40:31+0100, Deri wrote:
> I'm really hoping Branden's going to help me with that, I think he
> intimated that he might when he suggested I start a branch for the
> work. I have one more push to the branch to do, but I need to contact
> Peter since there is a minor tweak to om.tmac to make expandos work in
> mom.

Yes, I plan to help.  Some of the things you are having to do are ugly,
and making the existing GNU troff limitations described above really
hurt for a Do The Right Thing kind of programmer.  I want to fix these
problems so that the macro stuff we do in man(7) is as simple and easy
to understand as possible.  Complexity creates hiding places for bugs.

Regards,
Branden

[1] "[man] abbreviate titles when too wide, instead of overlapping them"
    https://savannah.gnu.org/bugs/?43532

[2] "[troff] string iteration handles escape sequences inconsistently
    (want `for` request)
    https://savannah.gnu.org/bugs/?62264

[3] "[troff] need a way to embed non-Basic Latin glyphs in device
    control commands"
    https://savannah.gnu.org/bugs/?63074

[4] "[troff] standard error output should be sanitized"
    https://savannah.gnu.org/bugs/?62787

[5] https://www.gnu.org/software/groff/manual/groff.html.node/Punning-Names.html#Punning-Names

[6] "[troff] .chop cannot surmount the barrier of a .char definition"
    https://savannah.gnu.org/bugs/?64439

[7] https://www.gnu.org/software/groff/manual/groff.html.node/Using-Symbols.html

[8] It actually doesn't.  In GNU troff, the GNU extension command 't' is
    used for sequences of non-overstruck ordinary characters when
    supported by output drivers, and in AT&T device-independent troff,
    the unnamed move-and-print command--a performance and storage
    optimization tuned to the needs of machines in 1980 and explained in
    CSTR #97--was used.  But 'c' is simple, supported by all device-
    independent troffs, and works.

[9] https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n112

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux