Re: Compressing packed-refs

Konstantin Ryabitsev <konstantin@xxxxxxxxxxxxxxxxxxx> · Thu, 16 Jul 2020 18:54:29 -0400

On Thu, Jul 16, 2020 at 03:27:15PM -0700, Junio C Hamano wrote:
> I think the reftable is the longer term direction, but let's see if
> there is easy enough optimization opportunity that we can afford the
> development and maintenance cost for the short term.
> 
> My .git/packed-refs file begins like so:
> 
>     # pack-refs with: peeled fully-peeled sorted 
>     c3808ca6982b0ad7ee9b87eca9b50b9a24ec08b0 refs/heads/maint-2.10
>     3b9e3c2cede15057af3ff8076c45ad5f33829436 refs/heads/maint-2.11
>     584f8975d2d9530a34bd0b936ae774f82fe30fed refs/heads/master
>     2cccc8116438182c988c7f26d9559a1c22e78f1c refs/heads/next
>     8300349bc1f0a0e2623d5824266bd72c1f4b5f24 refs/notes/commits
>     ...

Let me offer a more special-case (but not crazy) example from 
git.kernel.org. The newer version of grokmirror that I'm working on is 
built to take advantage of the pack-islands feature that was added a 
while back. We fetch all linux forks into a single "object storage" 
repo, with each fork going into its own 
refs/virtual/[uniquename]/(heads|tags) place. This means there's lots of 
duplicates in packed-refs, as all the tags from torvalds/linux.git will 
end up duplicated in almost every fork.

So, after running git pack-refs --all, the packed-refs file is 50-ish MB 
in size, with a lot of same stuff like:

5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00018460b026/tags/v2.6.11
^c39ae07f393806ccf406ef966e9a15afc43cc36a
...
5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00bcef8138af/tags/v2.6.11
^c39ae07f393806ccf406ef966e9a15afc43cc36a

etc, duplicated 600 times with each fork. It compresses decently well 
with gzip -9, and *amazingly* well with xz -9:

$ ls -ahl packed-refs
-rw-r--r--. 1 mirror mirror 46M Jul 16 22:37 packed-refs
$ ls -ahl packed-refs.gz
-rw-r--r--. 1 mirror mirror 19M Jul 16 22:47 packed-refs.gz
$ ls -ahl packed-refs.xz
-rw-r--r--. 1 mirror mirror 2.3M Jul 16 22:47 packed-refs.xz

Which really just indicates how much duplicated data is in that file. If 
reftables will eventually replace refs entirely, then we probably 
shouldn't expend too much effort super-optimizing it, especially if I'm 
one of the very few people who would benefit from it. However, I'm 
curious if a different sorting strategy would help remove most of the 
duplication without requiring too much engineering time.

-K