What makes a Git repository unwieldy to work with and host? It turns out that the respository's on-disk size in gigabytes is only part of the story. From our experience at GitHub, repositories cause problems because of poor internal layout at least as often as because of their overall size. For example, * blobs or trees that are too large * large blobs that are modified frequently (e.g., database dumps) * large trees that are modified frequently * trees that expand to unreasonable size when checked out (e.g., "Git bombs" [2]) * too many tiny Git objects * too many references * other oddities, such as giant octopus merges, super long reference names or file paths, huge commit messages, etc. `git-sizer` [1] is a new open-source tool that computes various size-related statistics for a Git repository and points out those that are likely to cause problems or inconvenience to its users. I tried to make the output of `git-sizer` "opinionated" and easy to interpret. Example output for the Linux kernel is appended below. I also made it memory-efficient and resistant against git bombs. I've written a blog post [3] about `git-sizer` with more explanation and examples, and the main project page [1] has a long README with some information about what the individual metrics mean and tips for fixing problems. I also put quite a bit of effort into making `git-sizer` fast. It does its work (including figuring out path names for large objects) based on a single traversal of the repository history using `git rev-list --objects --reverse [...]`, followed by using the output of `git cat-file --batch` or `git cat-file --batch-check` to get information about individual objects. On that subject, let me share some more technical details. `git-sizer` is written in Go. I prototyped several ways of extracting object information, which is critical to the performance because `git-sizer` has to read all of the reachable non-blob objects in the repository. The results surprised me: | Mechanism for accessing Git data | Time | | --------------------------------------------------- | -----: | | `libgit2/git2go` | 25.5 s | | `libgit2/git2go` with `ManagedTree` optimization | 18.9 s | | `src-d/go-git` | 63.0 s | | Git command line client | 6.6 s | It was almost a factor of four faster to read and parse the output of Git plumbing commands (mainly `git for-each-ref`, `git rev-list --objects`, `git cat-file --batch-check`, and `git cat-file --batch`) than it was to use the Go bindings to libgit2. (I expect that part of the reason is that Go's peculiar stack layout makes it quite expensive to call out to C.) Even after Carlos Martin implemented an experimental `ManagedTree` optimization that removed the need to call C for every entry in a tree, it was still not competitive with the Git CLI. `go-git`, which is a Git implementation in pure Go, was even slower. So the final version of `git-sizer` calls `git` for accessing the repository. Feedback is welcome, including about the weightings [4] that I use to compute the "level of concern" of the various metrics. Have fun, Michael [1] https://github.com/github/git-sizer [2] https://kate.io/blog/git-bomb/ [3] https://blog.github.com/2018-03-05-measuring-the-many-sizes-of-a-git-repository/ [4] https://github.com/github/git-sizer/blob/2e9a30f241ac357f2af01d42f0dd51fbbbae4b0b/sizes/output.go#L330-L401 $ git-sizer --verbose Processing blobs: 1652370 Processing trees: 3396199 Processing commits: 722647 Matching commits to trees: 722647 Processing annotated tags: 534 Processing references: 539 | Name | Value | Level of concern | | ---------------------------- | --------- | ------------------------------ | | Overall repository size | | | | * Commits | | | | * Count | 723 k | * | | * Total size | 525 MiB | ** | | * Trees | | | | * Count | 3.40 M | ** | | * Total size | 9.00 GiB | **** | | * Total tree entries | 264 M | ***** | | * Blobs | | | | * Count | 1.65 M | * | | * Total size | 55.8 GiB | ***** | | * Annotated tags | | | | * Count | 534 | | | * References | | | | * Count | 539 | | | | | | | Biggest objects | | | | * Commits | | | | * Maximum size [1] | 72.7 KiB | * | | * Maximum parents [2] | 66 | ****** | | * Trees | | | | * Maximum entries [3] | 1.68 k | | | * Blobs | | | | * Maximum size [4] | 13.5 MiB | * | | | | | | History structure | | | | * Maximum history depth | 136 k | | | * Maximum tag depth [5] | 1 | * | | | | | | Biggest checkouts | | | | * Number of directories [6] | 4.38 k | ** | | * Maximum path depth [7] | 13 | * | | * Maximum path length [8] | 134 B | * | | * Number of files [9] | 62.3 k | * | | * Total size of files [9] | 747 MiB | | | * Number of symlinks [10] | 40 | | | * Number of submodules | 0 | | [1] 91cc53b0c78596a73fa708cceb7313e7168bb146 [2] 2cde51fbd0f310c8a2c5f977e665c0ac3945b46d [3] 4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts) [4] a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h) [5] 5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11) [6] 1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree}) [7] 78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree}) [8] ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree}) [9] 532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree}) [10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})