Re: [PATCH] fetch: use "quick" has_sha1_file for tag following

Jeff King <peff@xxxxxxxx> · Fri, 14 Oct 2016 14:59:53 -0400

On Fri, Oct 14, 2016 at 10:39:52AM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > So it's certainly better. But 7500 packs is just silly, and squeezing
> > out ~400ms there is hardly worth it. If you repack this same case into a
> > single pack, the command drops to 5ms. So yes, there's close to an order
> > of magnitude speedup here, but you get that _and_ another order of
> > magnitude just by repacking.
> 
> "7500 is silly" equally applies to the "quick" (and sloppy, if I am
> reading your "Failing in this direction doesn't make me feel great."
> correctly) approach, I think, which argues for not taking either
> change X-<.

I wouldn't quite agree that they're the same. 7500 packs is silly
because a bunch of _other_ things are going to get equally or more slow
before the prepare_packed_git() slowdown is noticeable. And it's in your
power to clean up, and we should encourage users to do so.

Whereas having a bunch of unfetched tags is a state that may linger
indefinitely, and be outside the user's control.

I _am_ open to the argument that calling reprepare_packed_git() over and
over doesn't really matter much if you have a sane number of packs. If
you tweak my perf test like so:

diff --git a/t/perf/p5550-fetch-tags.sh b/t/perf/p5550-fetch-tags.sh
index a5dc39f..7e7ae24 100755
--- a/t/perf/p5550-fetch-tags.sh
+++ b/t/perf/p5550-fetch-tags.sh
@@ -86,7 +86,7 @@ test_expect_success 'create child packs' '
 		cd child &&
 		git config gc.auto 0 &&
 		git config gc.autopacklimit 0 &&
-		create_packs 500
+		create_packs 10
 	)
 '
 

you get:

Test            origin            quick                 
--------------------------------------------------------
5550.4: fetch   0.06(0.02+0.02)   0.02(0.01+0.00) -66.7%

Still an impressive speedup as a percentage, but negligible in absolute
terms. But that's on a local filesystem on a Linux machine. I'd worry
much more about a system with a slow readdir(), e.g., due to NFS.
Somebody's real-world NFS case[1] was what prompted us to do 0eeb077
(index-pack: avoid excessive re-reading of pack directory, 2015-06-09).

It looks like I _did_ look into optimizing this into a single stat()
call in the thread at [1]. I completely forgot about that. I did find
there that naively using stat_validity() on a directory is racy, though
I wonder if we could do something clever with gettimeofday() instead.
IOW, the patches there only bothered to re-read when they saw the mtime
on the directory jump, which suffers from 1-second precision problems.
But if we instead compared the mtime to the current time, we could err
in favor of re-reading the packs, and get false positives for at most 1
second.

[1] http://public-inbox.org/git/7FAE15F0A93C0144AD8B5FBD584E1C5519758FC3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

> I agree that the fallout from the inaccuracy of "quick" approach is
> probably acceptable and the next "fetch" will correct it anyway, so
> let's do the "quick but inaccurate" for now and perhaps cook it in
> 'next' for a bit longer than other topics?

I doubt that cooking in 'next' for longer will turn up anything useful.
The case we care about is the race between a repack and a fetch. We
lived with the "quick" version of has_sha1_file() everywhere for 8
years. I only noticed the race on a hosting cluster which puts through
millions of operations per day (I could in theory test the patch on that
same cluster, but we actually don't do very many fetches).

-Peff