Re: PDF book of unreleased pages (was: strncpy clarify result may not be null terminated)

Alejandro Colomar <alx@xxxxxxxxxx> · Sun, 19 Nov 2023 21:58:03 +0100

On Sun, Nov 19, 2023 at 04:21:45PM +0000, Deri wrote:
> > 	$ touch man2/membarrier.2
> > 	$ make build-pdf
> > 	PRECONV	.tmp/man/man2/membarrier.2.tbl
> > 	TBL	.tmp/man/man2/membarrier.2.eqn
> > 	EQN	.tmp/man/man2/membarrier.2.pdf.troff
> > 	TROFF	.tmp/man/man2/membarrier.2.pdf.set
> > 	GROPDF	.tmp/man/man2/membarrier.2.pdf
> > 
> > That helps debug the pipeline, and also learn about it.
> > 
> > If that helps parallelize some tasks, then that'll be welcome.
> 
> Hi Alex,

Hi Deri,

> Doing it that way actually stops the jobs being run in parallel! Each step 

Hmm, kind of makes sense.

> completes before the next step starts, whereas if you let groff build the 
> pipeline all the processes are run in parallel. Using separate steps may be 
> desirable for "understanding every little step of the groff pipeline", (and 

Still a useful thing for our build system.

> may aid debugging an issue), but once such knowledge is obtained it is 
> probably better to leave the pipelining to groff, in a production environment.

Unless performance is really a problem, I prefer the understanding and
debugging aid.  It'll help not only me, but others who see the project
and would like to learn how all this magic works.

> > > The time saved would be absolutely minimal. It is obvious that to produce
> > > a
> > > pdf containing all the man pages then all the man pages have to be
> > > consumed by groff, not just the page which has changed.
> > 
> > But do you need to run the entire pipeline, or can you reuse most of it?
> > I can process in parallel much faster, with `make -jN ...`.  I guess
> > the .pdf.troff files can be reused; maybe even the .pdf.set ones?
> > 
> > Could you change the script at least to produce intermediary files as in
> > the pipeline shown above?  As many as possible would be excellent.
> 
> Perhaps it would help if I explain the stages of my script. First a look at 
> what the script needs to do to produce a pdf of all man pages. There are too 
> many files to produce a single command line with all the filenames of each 
> man, groff has no mechanism for passing a list of filenames, so first job is 

You can always `find ... | xargs cat | troff /dev/stdin`

> to concatenate all the separate files into one input file for groff. And while 
> we are doing that, add the "magic sauce" which makes all the pdf links in the 
> book and sorts out the aliases which point to another man page.

Yep, I think I partially understood that part of the script today.  It's
what this `... | LC_ALL=C grep '^\\. *ds' |` pipeline produces and
passes to groff, right?

> After this is done there is a single troff file, called LMB.man, which is the 

That's what's currently called LinuxManBook.Z, right?

> file groff is going to process. In the script you should see something like 
> this:-
> 
> my $temp='LMB.man';

I don't.  Maybe you have a slightly different version of it?

> [...]
> 
> my $format='pdf';
> my $paper=$fpaper ||';
> my $cmdstring="-T$format -k -pet -M. -F. -mandoc -manmark -dpaper=$paper -P-
> p$paper -rC1 -rCHECKSTYLE=3";
> my $front='LMBfront.t';
> my $frontdit='LMBfront.set';
> my $mandit='LinuxManBook.set';
> my $book="LinuxManBook.$format";
> 
> system("groff -T$format -dpaper=$paper -P-p$paper -ms $front -Z > $frontdit");

This creates the front page .set file

> system("groff -z -dPDF.EXPORT=1 -dLABEL.REFS=1 $temp $cmdstring 2>&1 | 
> LC_ALL=C grep '^\\. *ds' |

This creates the bookmarks, right?

> groff -T$format $cmdstring - $temp -Z > $mandit");

And this is the main .set file.

> system("./gro$format -F.:/usr/share/groff/current/font $frontdit $mandit -
> p$paper  > $book");

And finally we have the book.

> 
> (This includes changes by Brian Inglish ts). If you remove the lines which 
> call system you will end up with just the single file LMB.man (in about a 
> quarter of a second). You can treat this file just the same as your single 
> page example if you want to.
> 
> The first system call creates the title page from the troff source file 
> LMBfront.t and produces LMBfront.set, this can be added to your makefile as an 
> entirely separate rule depending on whether the .set file needs to be built.
> 
> The second and third system calls are the calls to groff which could be put 
> into your makefile or split into separate stages to avoid parallelism.
> 
> The second system call produces LinuxManBook.set and the third system combines 
> this with LMBfront.set to produce the pdf.
> 
> The "./" in the third system call is because I gave you a pre-release gropdf, 
> you may be using the released 1.23.0 gropdf now.
> 
> > > On my system this takes about 18
> > > seconds to produce the 2800+ pages of the book. Of this, a quarter of a
> > > second is consumed by the "magic" part of the script, the rest of the 18
> > > seconds is consumed by calls to groff and gropdf.
> > 
> > But how much of that work needs to be on a single process?  I bought a
> > new CPU with 24 cores.  Gotta use them all  :D
> 
> I realise you are having difficulty in letting go of your idea of re-using 
> previous work, rather than starting afresh each time. Imagine a single word 
> change in one man page causes it to grow from 2 pages to 3, so all links to 
> pages after this changed entry would be one page adrift. This is why very 
> little previous work is useful, and why the whole book has to be dealt with as 
> a single process.

Does such a change need re-running troff(1)?  Or is gropdf(1) enough?  If
troff(1)

My problem is probably that I don't know what's done by `gropdf`, and
what's done by `troff -Tpdf`.  I was hoping that `troff -Tpdf` still
didn't need to know about the entire book, and that only gropdf(1) would
need that.

> If each entry was processed separately, as you would like to 
> use all your shiny new cores, how would the process dealing with accept(2) 
> know which page socket(2) would be on when it adds it as a link in the text. I 
> hope you can see that at some point it has to be treated as a homogenous whole 
> in order calculate correct links between entries.
> 
> > > So any splitting of the perl script is
> > > only going to have an effect on the quarter of a second!
> > > 
> > > I don't understand why the perl script can't be included in your make file
> > > as part of build-pdf target.
> > 
> > It can.  I just prefer to be strict about the Makefile having "one rule
> > per each file", while currently the script generates 4 files (T, two
> > .Z's, and the .pdf).
> 
> Explained how to separate above so that the script only generates LMB.man and 
> the system calls moved to the makefile.

Thanks!

> > > Presumably it would be dependent on running after
> > > the scripts which add the revision label and date to each man page.
> > 
> > I only set the revision and date on dist tarballs.  For the git HEAD
> > book, I'd keep the (unreleased) version and (date).  So, no worries
> > there.
> 
> Given that you seem to intend to offer these interim books as a download, it 
> would make sense if they included either a date or git commit ID to 
> differenciate them, if someone queries something it would be useful to know 
> exactly what they were looking at.

The books for releases are available at

<https://www.alejandro-colomar.es/share/dist/man-pages/6/6.05/6.05.01/man-pages-6.05.01.pdf>

(replace the version numbers for other versions, or navigate the dirs)
I need to document that in the README of the project.

For git HEAD, I plan to have something like

<https://www.alejandro-colomar.es/share/dist/man-pages/git/man-pages-HEAD.pdf>

It's mainly intended for easily checking what git HEAD looks like, and
discard that later.  If the audience asks for version numbers, though,
I could create provide `git --describe` versions and dates in the pages.

> Cheers 
> 
> Deri
> 
> > > > Since I don't understand Perl, and don't know much of gropdf(1) either,
> > > > I need help.
> > > > 
> > > > Maybe Deri or Branden can help with that.  If anyone else understands it
> > > > and can also help, that's very welcome too!
> > > 
> > > You are probably better placed to add the necessaries to your makefile.
> > > You
> > > would then just need to remember to make build-pdf any time you alter one
> > > of the source man pages. Since you are manually running my script to
> > > produce the pdf, it should not be difficult to automate it in a makefile.
> > > 
> > > > Then I could install a hook in my server that runs
> > > > 
> > > > 	$ make build-pdf docdir=/srv/www/...
> > > 
> > > And wait 18s each time the hook is actioned!! Or, set the build to place
> > > the generated pdf somewhere in /srv/www/... and include the build in your
> > > normal workflow when a man page is changed.
> > 
> > Hmm.  I still hope some of it can be parallelized, but 18s could be
> > reasonable, if the server does that in the background after pushing.
> > My old raspberry pi would burn, but the new computer should handle that
> > just fine.
> 
> I'm confused. The 18s is how long it takes to generate the book, so if the 
> book is built in response to an access to a particular url the http server 
> can't start "pushing" for the 18s, then addon the transfer time for the pdf 
> and I suspect you will have a lot of aborted transfers. Additionally, the 
> script, and any makefile equivalent you write, is not designed for concurrent 
> invocation, so if two people visit the same url within the 18 second window 
> neither user will receive a valid pdf.

No, my intention is that whenever I `git push` via SSH, the receiving
server runs `make build-book-pdf` after receiving the changes.  That is
run after the git SSH connection has closed, so I wouldn't notice.

HTTP connections wouldn't trigger anything in my server, except Nginx
serving the file, of course.

> I advise the build becomes part of your workflow after making changes, and 
> then place the pdf in a location where it can be served by the http server.
> 
> Your model of slicing and dicing man pages to be processed individually is 
> doable using a website to serve the individual pages, see:-
> 
> http://chuzzlewit.co.uk/WebManPDF.pl/man:/2/accept
> 
> This is running on a 1" cube no more powerful than a raspberry pi 3. The 
> difference is that the "magic sauce" added to each man page sets the links to 
> external http calls back to itself to produce another man page, rather than 
> internal links to another part of the pdf. You can get an index of all the man 
> pages, on the (very old) system, here.
> 
> http://chuzzlewit.co.uk/

Yep, I've seen that server :)
Long term I also intend to provide one-page PDFs and HTML files of the
pages.  Although I prefer pre-generating them, instead of on-demand.
Maybe a git hook, or maybe a cron job that re-generates them once a day
or so.

Cheers,
Alex

> 
> Cheers 
> 
> Deri

-- 
<https://www.alejandro-colomar.es/>
Attachment:
signature.asc

Description: PGP signature