Re: Compressing packed-refs

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 16 Jul 2020 15:27:15 -0700

Konstantin Ryabitsev <konstantin@xxxxxxxxxxxxxxxxxxx> writes:

> I know repos with too many refs is a corner-case for most people, but 
> it's looming large in my world, so I'm wondering if it makes sense to 
> compress the packed-refs file when "git pack-refs" is performed?

I think the reftable is the longer term direction, but let's see if
there is easy enough optimization opportunity that we can afford the
development and maintenance cost for the short term.

My .git/packed-refs file begins like so:

    # pack-refs with: peeled fully-peeled sorted 
    c3808ca6982b0ad7ee9b87eca9b50b9a24ec08b0 refs/heads/maint-2.10
    3b9e3c2cede15057af3ff8076c45ad5f33829436 refs/heads/maint-2.11
    584f8975d2d9530a34bd0b936ae774f82fe30fed refs/heads/master
    2cccc8116438182c988c7f26d9559a1c22e78f1c refs/heads/next
    8300349bc1f0a0e2623d5824266bd72c1f4b5f24 refs/notes/commits
    ...

A few observations that can lead to easy design elements are

 - Typically more than half of each records is consumed by the
   object name that is hard to "compress".

 - The file is sorted, so it could use the prefix compression like
   we do in the v4 index files.

So perhaps a new format could be

 - The header "# pack-refs with: " lists a new trait, "compressed";

 - Object names will be expressed in binary, saving 20 bytes per a
   record;

 - Prefix compression of the refnames similar to v4 index would save
   a bit more.

Storing binary object names would actually be favourable for
performance, as the in-core data structure we use to store the
result of parsing the file uses binary.