On Fri, Mar 16 2018, Michael Haggerty jotted: > What makes a Git repository unwieldy to work with and host? It turns > out that the respository's on-disk size in gigabytes is only part of > the story. From our experience at GitHub, repositories cause problems > because of poor internal layout at least as often as because of their > overall size. For example, > > * blobs or trees that are too large > * large blobs that are modified frequently (e.g., database dumps) > * large trees that are modified frequently > * trees that expand to unreasonable size when checked out (e.g., "Git > bombs" [2]) > * too many tiny Git objects > * too many references > * other oddities, such as giant octopus merges, super long reference > names or file paths, huge commit messages, etc. > > `git-sizer` [1] is a new open-source tool that computes various > size-related statistics for a Git repository and points out those that > are likely to cause problems or inconvenience to its users. This is a very useful tool. I've been using it to get insight into some bad repositories. Suggestion for a thing to add to it, I don't have the time on the Go tuits: One thing that can make repositories very pathological is if the ratio of trees to commits is too low. I was dealing with a repo the other day that had several thousand files all in the same root directory, and no subdirectories. This meant that doing `git log -- <file>` was very expensive. I wrote a bit about this on this related ticket the other day: https://gitlab.com/gitlab-org/gitlab-ce/issues/42104#note_54933512 But it's not something where you can just say having more trees is better, because on the other end of the spectrume we can imagine a repo like linux.git where each file like COPYING instead exists at C/O/P/Y/I/N/G, that would also be pathological. It would be very interesting to do some tests to see what the optimal value would be. I also suspect it's not really about the commit / tree ratio, but that you have some reasonable amount of nested trees per file, *and* that changes to them are reasonably spread out. I.e. it doesn't help if you have a doc/ and a src/ directory if 99% of your commits change src/, and if you're doing 'git log -- src/something.c'. Which is all a very long-winded way of saying that I don't know what the general rule is, but I have some suspicions, but having all your files in the root is definitely bad.