Re: Migration of git-scm.com to a static web site: ready for review/testing

Todd Zullinger <tmz@xxxxxxxxx> · Fri, 17 Nov 2023 21:57:44 -0500

Hi Johannes,

Johannes Schindelin wrote:
>> For checking links, a tool like linkcheker[1] is very handy.
>> This is run against the local docs in the Fedora package
>> builds to catch broken links.
> 
> Hmm, `linkchecker` is really slow for me, even locally.

Yeah, it took an hour and a half to run for me, both on an
old laptop and a fast server with plenty of threads,
bandwidth, and memory.

Checking the git HTML documentation takes under 30 seconds,
which is largely the only place I've used it.  It has been
very helpful in catching broken links in the docs during the
build and the time is short enough that I never minded.

> Granted, the added cross-references now increase the number of hyperlinks
> to check, but after I let the program run for a bit over an hour to look
> at https://git-scm.com/ (for comparison), it is now running on the local
> build (i.e. the `public/` folder generated by Hugo, not even an HTTP
> server) for over 45 minutes and still not done:
> 
> -- snip --
> [...]
> 10 threads active, 112977 links queued, 206443 links in 100001 URLs checked, runtime 48 minutes, 46 seconds
> 10 threads active, 113455 links queued, 206689 links in 100001 URLs checked, runtime 48 minutes, 52 seconds
> 10 threads active, 113829 links queued, 206874 links in 100001 URLs checked, runtime 48 minutes, 57 seconds
> 10 threads active, 114230 links queued, 207136 links in 100001 URLs checked, runtime 49 minutes, 3 seconds
> 10 threads active, 114731 links queued, 207498 links in 100001 URLs checked, runtime 49 minutes, 9 seconds
> -- snap --

I would have thought that bumping the number of threads a
lot would really help, but I ran it on a dual Xeon system
with 40 threads and it took about the same time.  Perhaps I
should have increased to double or more the system processor
count.

> Maybe something is going utterly wrong because the number
> of links seems to be dramatically larger than what the
> https://git-scm.com/ reported; Maybe linkchecker broke out
> of the `public/` directory and now indexes my entire
> harddrive ;-)

Heh, hopefully not. :)

I wondered if there were circular links that it was picking
up and not de-duplicating.  I may try to run it with the
--verbose option which logs all checked URLs.  Maybe that
will turn up something.  It sure seems like there's a _lot_
of links here.

There is a --recursion-level option which might be helpful.
The --ignore-url and/or --no-follow-url may also be useful.

Though even if it's (very) slow, it might be worth running
to flush out some initial issues before making the site
live.  Letting it run in the background for a few hours is
probably less effort than fielding a number of big reports
about broken URL here and there. :)

Of course, it would be even better if it were fast enough to
run as part of the site build process to catch broken links
before each deployment, but that would need to be measured
in some relatively small number of seconds instead of the
hours it seems to take now. :/

-- 
Todd