[looping in groff@gnu because I'm talking about GNU troff internals and development plans I have] At 2023-08-14T22:22:16+0100, Deri wrote: > On Monday, 14 August 2023 21:01:46 BST Alejandro Colomar wrote: > > On 2023-08-14 19:37, Alejandro Colomar wrote: > > >> Another change which would need to be accepted is to allow a > > >> fourth parameter to .MR which is the destination name. Normally > > >> the name of the destination is derived from the first two > > >> parameters concatenated with "_", but if the name part of the .MR > > >> call to the man page includes non- ascii characters (such as ".MR > > >> my\-lovely\-page 7 ,") then it needs to provide a "clean" > > >> destination name. (linux-man readers might want to check out here, except for one clarification about `MR` below. The presentation below is mostly about macro package and formatter internals.) Yes. This sort of thing has been undertaken a couple of times in macro packages for groff. Keith Marshall's pdfmark and your "an:cln" in the "deri-gropdf-ng" branch are two examples. I also had a couple of cracks at it in my new feature in 1.23 for abbreviating pieces of man page headers/footers if they would overrun other parts of it.[1] These things are painful and fragile to implement. We need a Better Way(tm). So I've proposed one and have roughed most of it out in my Git stash. I've mentioned it before on the groff list. What we need is a *roff string iterator.[2] I think it's likely that implementing one will also help us with other problems.[3][4] Because strings, macros, and diversions can be interchanged, or at least punned,[5] I think we'll also need a new conditional expression operator to identify iterands that aren't ordinary or special characters. The only application of this that I envision, at least in early days, is so that a macro iterating over them could skip them, ignore them, or throw an error. Already we have problems when people apply the cruder tools GNU troff has for string manipulation to things that aren't representable as individual characters within a string.[6] With a general string/macro/diversion iterator, we can retire a lot of the existing GNU troff requests that deal with strings, replacing them with macros using the iterator request, which I intend to call `for`. A reverse iterator, `rfor`, may also be necessary. Using the iterator request, in a file called "string.tmac" maybe, `length`, `chop`, `substring`, `stringup`, and `stringdown` could all become macros. It would be easy to add `index` and `rindex` (after the old BSD string library functions, now superseded by strchr() strrchr()), which macro programmers have requested repeatedly over the years. Often, just searching through a string for a character is all a macro wants to do. It's always been absurdly baroque to do that in GNU troff. In AT&T troff, it was nigh impossible. (If I don't say "nigh", Tadziu Hoffman will appear in a puff of brimstone, arrayed in mask and cloak, to show us some obfuscated but brilliant hack to get it done. Okay, forget I said "nigh". ;-) ) > > I just re-read this, and am confused. '\-' is an ASCII character, > > isn't it? In fact, all of the Linux man-pages pathnames are > > composed exclusively of ASCII characters, aren't they? You're thinking about this at the wrong level, Alex. `\-` is a *roff special character. Unless converted to something else by character translation or character definition,[7] it goes to the device-independent page description language as a special character too. Here's a quick glimpse of device-independent troff output, albeit without helpful italics. c c Typeset the glyph of the ordinary character c. The drawing position is not advanced. C id⟨whitespace⟩ Typeset the glyph of the special character id. Trailing syntactical space is necessary to allow special character names of arbitrary length. The drawing position is not advanced. Consider a three-character *roff input document. printf -- '-\\-\n' | groff -Tascii -Z The part where the glyphs are written out looks like this.[8] c- C\- It is up to the output device to decide what to do with that. groff's "ascii" and "latin1" output devices put out a U+002D character; its "utf8" device puts out a minus sign, U+2212. Now, before anyone defecates a brick about the U+2212 not being easily greppable, nor useful for copying and pasting to a shell prompt, the man(7) and mdoc(7) macro packages override that. In man pages (only), groff maps the minus sign special character '\-' to the Basic Latin hyphen-minus (U+002D) because man pages require this glyph and there is no historically established *roff input character, ordinary or special, for obtaining it when a hyphen and minus sign are both separately available. To obtain a true minus sign, use the special character escape sequences '\(mi' or '\[mi]'.[9] > ... If you look at the keyrings(7) man page you see examples such as:- > > .BR persistent\-keyring (7) , [...] > Which when converted to .MR calls looks like:- [...] > .MR "persistent\-keyring" "7" "," "persistent-keyring" Urp. No, it doesn't. Not unless you changed `MR` in deri-gropdf-ng. .BR persistent\-keyring (7) , when converted to an `MR` call, looks like this. .MR persistent\-keyring 7 , I expect man page authors would violently protest if they were told they had to type all those quotes and, worse, repeat the name of the page. One of the selling points of `MR` is less typing (no parentheses). It is hard enough to sell that macro on the linux-man list without inaccurate claims entering the fray. Now, if I understand correctly, is quite possible that something you're doing in your branch is having `MR` call another macro internally to prepare a hyperlink with some "anchor"--I won't say "node" because collides with GNU troff internal jargon--information. (This is suggested by the heavy quoting you showed, since when macros call each other with arbitrary numbers of arguments, and those arguments need to be kept separate in the callee, the caller should use the `\$@` escape sequence, which is analogous to the POSIX shell's `$@`.) In your example above we see the "\-" special character getting converted to an ordinary one, "-". That's totally fine and is the sort of thing I want to make easy--fundamentally with the `for` string iterator request and practically with an `index` (or `strchr`, or `bikeshed`) macro. > On the keyrings(7) page in the pdf you should be able to see the > difference between HYPHEN (U+2010), which is what \- becomes, and > HYPHEN-MINUS (U+002D) which is the ascii character. The problem is > that the MR request is a bit naughty in that it uses the first two > parameters for two purposes, first it is used as a destination, but it > is also output as text. I don't think it's naughty; I think that by and large, man page authors don't care to give "anchor names" to elements of their document. They want the macro package to figure it out. I think one reason--maybe the only reason--people are getting a glimpse inside the sausage factory of GNU troff internals is because we haven't had a defined mechanism for getting character data to an output device that is neither (1) intended for formatting (writing visible glyphs) nor (2) in the printable ASCII (Unicode Basic Latin) character set. That's the aforementioned Savannah #63074.[3] Looking farther ahead, I think a further step is required if we're going to have intra-page links; we're going to have to have a way to disambiguate duplicates. In practice there's not much risk from having duplicate section titles in man pages, but I reckon a big, complex page could duplicate subsection titles. And if we automatically generate hyperlink tags for paragraph tags, those would likely need it as well. Maybe representing such internal anchors hierarchically will be enough: "section_subsection_tag" or something like that. I'm confident this problem has been robustly solved elsewhere, and that all we need to do is identify, adopt, and adapt a known good solution. > So as text it may contain escapes to enhance the typography, for > example using \- for a better looking hyphen. It is not my job to > impose artificial restrictions on how a man page author wants his > creation to look, better to separate the two purposes, so there is no > restriction. Agreed. > > > Is this really needed? Can't gropdf just translate them > > > internally? Say, do unconditionally the equivalent of `| tr - _ > > > |` or something like that. > > > > > > [...] > > This is all happening in groff macros way before it gets to gropdf. I hope I have shed some light on this proces. At 2023-08-14T22:40:31+0100, Deri wrote: > I'm really hoping Branden's going to help me with that, I think he > intimated that he might when he suggested I start a branch for the > work. I have one more push to the branch to do, but I need to contact > Peter since there is a minor tweak to om.tmac to make expandos work in > mom. Yes, I plan to help. Some of the things you are having to do are ugly, and making the existing GNU troff limitations described above really hurt for a Do The Right Thing kind of programmer. I want to fix these problems so that the macro stuff we do in man(7) is as simple and easy to understand as possible. Complexity creates hiding places for bugs. Regards, Branden [1] "[man] abbreviate titles when too wide, instead of overlapping them" https://savannah.gnu.org/bugs/?43532 [2] "[troff] string iteration handles escape sequences inconsistently (want `for` request) https://savannah.gnu.org/bugs/?62264 [3] "[troff] need a way to embed non-Basic Latin glyphs in device control commands" https://savannah.gnu.org/bugs/?63074 [4] "[troff] standard error output should be sanitized" https://savannah.gnu.org/bugs/?62787 [5] https://www.gnu.org/software/groff/manual/groff.html.node/Punning-Names.html#Punning-Names [6] "[troff] .chop cannot surmount the barrier of a .char definition" https://savannah.gnu.org/bugs/?64439 [7] https://www.gnu.org/software/groff/manual/groff.html.node/Using-Symbols.html [8] It actually doesn't. In GNU troff, the GNU extension command 't' is used for sequences of non-overstruck ordinary characters when supported by output drivers, and in AT&T device-independent troff, the unnamed move-and-print command--a performance and storage optimization tuned to the needs of machines in 1980 and explained in CSTR #97--was used. But 'c' is simple, supported by all device- independent troffs, and works. [9] https://git.savannah.gnu.org/cgit/groff.git/tree/PROBLEMS?h=1.23.0#n112
Attachment:
signature.asc
Description: PGP signature