Re: [BUG] add_again() off-by-one error in custom format

René Scharfe <l.s.r@xxxxxx> · Thu, 15 Jun 2017 13:33:34 +0200

Am 15.06.2017 um 07:56 schrieb Jeff King:
One interesting thing is that the cost of finding short hashes very much
depends on your loose object setup. I timed:

   git log --format=%H >/dev/null

versus

   git log --format=%h >/dev/null

on git.git. It went from about 400ms to about 800ms. But then I noticed
I had a lot of loose object directories, and ran "git gc --prune=now".
Afterwards, my timings were more like 380ms and 460ms.

The difference is that in the "before" case, we actually opened each
directory and ran getdents(). But after gc, the directories are gone
totally and open() fails. We also have to do a linear walk through the
objects in each directory, since the contents are sorted.

Do you mean "unsorted"?

So I wonder if it is worth trying to optimize the short-sha1 computation
in the first place. Double-%h aside, that would make _everything_
faster, including --oneline.

Right.

I'm not really sure how, though, short of caching the directory
contents. That opens up questions of whether and when to invalidate the
cache. If the cache were _just_ about short hashes, it might be OK to
just assume that it remains valid through the length of the program (so
worst case, a simultaneous write might mean that we generate a sha1
which just became ambiguous, but that's generally going to be racy
anyway).

The other downside of course is that we'd spend RAM on it. We could
bound the size of the cache, I suppose.

You mean like an in-memory pack index for loose objects?  In order to
avoid the readdir call in sha1_name.c::find_short_object_filename()?
We'd only need to keep the hashes of found objects.  An oid_array
would be quite compact.

Non-racy writes inside a process should not be ignored (write, read
later) -- e.g. checkout does something like that.

Can we trust object directory time stamps for cache invalidation?

René