On Thu, Jul 16, 2020 at 03:27:15PM -0700, Junio C Hamano wrote: > I think the reftable is the longer term direction, but let's see if > there is easy enough optimization opportunity that we can afford the > development and maintenance cost for the short term. > > My .git/packed-refs file begins like so: > > # pack-refs with: peeled fully-peeled sorted > c3808ca6982b0ad7ee9b87eca9b50b9a24ec08b0 refs/heads/maint-2.10 > 3b9e3c2cede15057af3ff8076c45ad5f33829436 refs/heads/maint-2.11 > 584f8975d2d9530a34bd0b936ae774f82fe30fed refs/heads/master > 2cccc8116438182c988c7f26d9559a1c22e78f1c refs/heads/next > 8300349bc1f0a0e2623d5824266bd72c1f4b5f24 refs/notes/commits > ... Let me offer a more special-case (but not crazy) example from git.kernel.org. The newer version of grokmirror that I'm working on is built to take advantage of the pack-islands feature that was added a while back. We fetch all linux forks into a single "object storage" repo, with each fork going into its own refs/virtual/[uniquename]/(heads|tags) place. This means there's lots of duplicates in packed-refs, as all the tags from torvalds/linux.git will end up duplicated in almost every fork. So, after running git pack-refs --all, the packed-refs file is 50-ish MB in size, with a lot of same stuff like: 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00018460b026/tags/v2.6.11 ^c39ae07f393806ccf406ef966e9a15afc43cc36a ... 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c refs/virtual/00bcef8138af/tags/v2.6.11 ^c39ae07f393806ccf406ef966e9a15afc43cc36a etc, duplicated 600 times with each fork. It compresses decently well with gzip -9, and *amazingly* well with xz -9: $ ls -ahl packed-refs -rw-r--r--. 1 mirror mirror 46M Jul 16 22:37 packed-refs $ ls -ahl packed-refs.gz -rw-r--r--. 1 mirror mirror 19M Jul 16 22:47 packed-refs.gz $ ls -ahl packed-refs.xz -rw-r--r--. 1 mirror mirror 2.3M Jul 16 22:47 packed-refs.xz Which really just indicates how much duplicated data is in that file. If reftables will eventually replace refs entirely, then we probably shouldn't expend too much effort super-optimizing it, especially if I'm one of the very few people who would benefit from it. However, I'm curious if a different sorting strategy would help remove most of the duplication without requiring too much engineering time. -K