Re: Bogus index in man-pages book from other projects

"G. Branden Robinson" <g.branden.robinson@xxxxxxxxx> · Tue, 12 Mar 2024 10:15:18 -0500

[looping in groff list]

Hi Alex,

At 2024-03-12T15:12:52+0100, Alejandro Colomar wrote:
> Hmm, interesting thing to try!  I've tried it too, and the bookmarks
> for the in-page sections (e.g., DESCRIPTION, or rather ОПИСАНИЕ)
> appear with no name (or maybe it's a locale problem in my system?).
> See attached PDF.

That's a known problem with groff 1.23.0 and earlier.  It went less
remarked-upon than it should have because it turns out there is a way to
sneak character codes with the eighth bit set as-is out of the formatter
into device-independent output.  (I just learned about this mechanism
this past week.)  And since a lot of groff users were happy with the ISO
8859-1 character repertoire in their documents, they were fine with it.

It's gone unresolved longer than it should have because fixing it is
challenging.  If you catch up on groff mailing list traffic for January
and February you will see Deri and me discussing it.

I have a solution that I think will work,[1] but it keeps growing in
scope.  The thing I learned last week is that the `\!` escape sequence
can be used to smuggle character codes 129-255 decimal into grout.
(See attachment.)[2]  Addressing this requires surgery on a part of the
formatter that tends to be used only by relative experts and there
aren't unit tests of this escape sequence to assuage my fear that I
don't break things, so I'll have to write them.

Regards,
Branden

[1] The main element of it is to have the `device` request and the `\X`
    escape sequence (the latter being an AT&T troff feature) read their
    parameters in copy mode.

    https://savannah.gnu.org/bugs/?64484

[2] Interestingly, what GNU troff does here is compatible with DWB 3.3
    troff but not Heirloom Doctools troff; but Heirloom otherwise has no
    problem emitting UTF-8 sequences, for instance as arguments to trout
    'C' commands.

    My plan is to have GNU troff reject code points 128 <= n <= 255 in
    arguments to the `device` and `output` requests (both GNU
    extensions) and in `\!` and `\X` escape sequence parameters.  We
    don't know what character encoding an output device requires, so my
    proposal is to require input documents (including macro packages) to
    express such code points as groff Unicode special character escape
    sequences (that is, in the form \[u123AB]).  An alternative would be
    to have the output device report what encoding it requires in its
    DESC file, and give GNU troff the responsibility of converting to
    that encoding when writing output.  But to me that seems like an
    inferior solution, loading up the formatter with more character
    set-conversion functionality when it's increasingly a UTF-8 world
    anyway.  The likely persistent exception is the UTF-16-oriented PDF
    device.  Fortunately, in groff 1.23.0, Deri added support to
    gropdf(1) for interpretation of such escape sequences in device
    "specials" (device control commands; "x X" commands in trout/grout).
    I'm attaching another couple of examples to illustrate this.

    Also, if we make the formatter strict about 7-bit-clean input in
    groff 1.24, that will clear the decks for moving from an assumption
    of Latin-1 input today to UTF-8 input in 1.25.
.\" troff | hd # or your choice of hex dumper
Hello, world.
.sp
\!x X The Stupendous Yäppi will now read your mind!
.sp
Bye.
.\" groff -Kutf8 -Tpdf
.nr index 0 1
.de Section
.  sp 1i
.  ft B
.  pdfbookmark 1 "\\$*"
.  ds mark!\\n+[index] \\*[PDFBOOKMARK.NAME]
.  nop \\$*
.  ft
.  sp
..
.Section "\%A naïve attempt at bookmarking"
Sed ut perspiciatis, unde omnis iste natus error sit voluptatem
accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab
illo inventore veritatis et quasi architecto beatae vitae dicta sunt,
explicabo.  Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur
aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione
voluptatem sequi nesciunt, neque porro quisquam est, qui dolorem ipsum,
quia dolor sit amet consectetur adipiscivelit, sed quia non-numquam eius
modi tempora incidunt, ut labore et dolore magnam aliquam quaerat
voluptatem.
.bp
.Section "Another section"
Return to
.pdfhref L -D \*[mark!1] -- the first section
or
.pdfhref L -A . -D \*[mark!2] -- the last one
.\" groff -Tpdf
.\" needs groff 1.23.0 or later
.nr index 0 1
.de Section
.  sp 1i
.  ft B
.  pdfbookmark 1 "\\$*"
.  ds mark!\\n+[index] \\*[PDFBOOKMARK.NAME]
.  nop \\$*
.  ft
.  sp
..
.Section "\%A na\[u00EF]ve attempt at bookmarking"
Sed ut perspiciatis, unde omnis iste natus error sit voluptatem
accusantium doloremque laudantium, totam rem aperiam eaque ipsa, quae ab
illo inventore veritatis et quasi architecto beatae vitae dicta sunt,
explicabo.  Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur
aut odit aut fugit, sed quia consequuntur magni dolores eos, qui ratione
voluptatem sequi nesciunt, neque porro quisquam est, qui dolorem ipsum,
quia dolor sit amet consectetur adipiscivelit, sed quia non-numquam eius
modi tempora incidunt, ut labore et dolore magnam aliquam quaerat
voluptatem.
.bp
.Section "Another section"
Return to
.pdfhref L -D \*[mark!1] -- the first section
or
.pdfhref L -A . -D \*[mark!2] -- the last one
Attachment:
signature.asc

Description: PGP signature