I've been focused recently on understanding and mitigating the growth of a few internal repositories. Some of these are growing much larger than expected for the number of contributors, and there are multiple aspects to why this growth is so large. I will be submitting an RFC with my deep dive into the many aspects of this issue, but the very last thing I discovered in this space is actually the easiest change to make. The main issue plaguing these repositories is that deltas are not being computed against objects that appear at the same path. While the size of these files at tip is one aspect of growth that would prevent this issue, the changes to these files are reasonable and should result in good delta compression. However, Git is not discovering the connections across different versions of the same file. One way to find some improvement in these repositories is to increase the window size, which was an initial indicator that the delta compression could be improved, but was not a clear indicator. After some digging (and prototyping some analysis tools) the main discovery was that the current name-hash algorithm only considers the last 16 characters in the path name and has some naturally-occurring collisions within that scope. This series introduces a new name-hash algorithm, but does not replace the existing one. There are cases, such as packing a single snapshot of a repository, where the existing algorithm outperforms the new one. However, my findings show that when a repository has many versions of files at the same path (and especially when there are many name-hash collisions) then there are significant gains to be made using the new algorithm. Repo Standard Repack With --full-name-hash fluentui 438 MB 168 MB Repo B 6,255 MB 829 MB Repo C 37,737 MB 7,125 MB Repo D 130,049 MB 6,190 MB The main change in this series is in patch 1, which adds the algorithm and the option to 'git pack-objects' and 'git repack'. The remaining patches are focused on creating more evidence around the value of the new name-hash algorithm and its effects on the packfiles created with it. I will also try to make clear that I've been focused on client-side performance and size concerns. I do not know if using this option will have issues with advanced server-side repacking features, such as delta islands, reachability bitmaps, or serving clones and fetches from the resulting packfile. My educated guess is that the name-hash value does not affect these features in any direct way, but I'll leave the testing of the server scenarios to the experts. Thanks, -Stolee Derrick Stolee (4): pack-objects: add --full-name-hash option git-repack: update usage to match docs p5313: add size comparison test p5314: add a size test for name-hash collisions Documentation/git-pack-objects.txt | 3 +- Documentation/git-repack.txt | 4 +- Makefile | 1 + builtin/pack-objects.c | 20 ++++++--- builtin/repack.c | 9 +++- pack-objects.h | 20 +++++++++ t/helper/test-name-hash.c | 23 ++++++++++ t/helper/test-tool.c | 1 + t/helper/test-tool.h | 1 + t/perf/p5313-pack-objects.sh | 71 ++++++++++++++++++++++++++++++ t/perf/p5314-name-hash.sh | 41 +++++++++++++++++ t/t0450/txt-help-mismatches | 1 - 12 files changed, 186 insertions(+), 9 deletions(-) create mode 100644 t/helper/test-name-hash.c create mode 100755 t/perf/p5313-pack-objects.sh create mode 100755 t/perf/p5314-name-hash.sh base-commit: 4c42d5ff284067fa32837421408bebfef996bf81 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1785%2Fderrickstolee%2Ffull-name-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1785/derrickstolee/full-name-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1785 -- gitgitgadget