Hey folks (apologies if repost; my first post seemed to disappear entirely) We're hosting a service with some fairly large repos (created by Kart[1] ), and I've been looking into some poor performance of `git push` on our service. Background: We host repositories with a specific layout. I'll try and avoid most of the technical details but a brief description of the repo layout might be helpful: - At each revision we have 256 trees - each containing 256 trees (so 65536 trees at this level) - each subtree contains a number of objects (distributed via a hash scheme, evenly across the subtrees) - Some repos have up to 100 million blobs active in a given revision. In that case each of the 65536 subtrees would contain ~1500 blobs. - Blobs are usually a few bytes to a few KB in size. - For various reasons we have disabled deltas entirely. - Most repos have a few hundred commits, and a typical commit might modify 100,000 features (again spread evenly across the 65536 trees), thus modifying most of the trees also. - Our largest repos are currently a few hundred GB on disk. We've come across a curious performance issue with `git index-pack` when invoked by `receive-pack` during a push operation. We have `transfer.fsckObjects=true` in the server config, so the index-pack invocation looks like: ``` git --shallow-file shallow_filename index-pack \ --stdin --keep='receive-pack 1234 on <servername>' \ --show-resolving-progress --report-end-of-input --fix-thin \ --strict ``` For our largest repos, when pushing ~100K blobs and associated trees, this takes a *long* time - sometimes over 12 hours. The process uses enormous amounts of disk IO (all reads; I haven't measured how much per process, but the server was doing many terabytes of IO in total) Here is one that "only" took 45 minutes with a few tracing environment vars enabled: ``` $ cat craig.pack | /opt/sno/libexec/git-core/git --shallow-file myfilename index-pack --stdin --keep='receive-pack 159567 on servername' --show-resolving-progress --report-end-of-input --fix-thin --strict 07:48:20.781099 common-main.c:48 version 2.29.2 07:48:20.781111 common-main.c:48 | d0 | main | version | | | | | 2.29.2 07:48:20.781127 common-main.c:49 start /opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack --stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input --fix-thin --strict 07:48:20.781133 common-main.c:49 | d0 | main | start | | 0.000264 | | | /opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack --stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input --fix-thin --strict 07:48:20.781296 git.c:444 trace: built-in: git index-pack --stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input --fix-thin --strict 07:48:20.781306 git.c:445 cmd_name index-pack (index-pack) 07:48:20.781312 git.c:445 | d0 | main | cmd_name | | | | | index-pack (index-pack) 07:48:20.781530 midx.c:184 | d0 | main | data | r0 | 0.000670 | 0.000670 | midx | load/num_packs:1 07:48:20.781542 midx.c:185 | d0 | main | data | r0 | 0.000683 | 0.000683 | midx | load/num_objects:42658742 pack 5aa14bbb43187b7dfd5f996514854c3dcdc66d71 08:27:33.724306 git.c:700 exit elapsed:2352.943441 code:0 08:27:33.724321 git.c:700 | d0 | main | exit | | 2352.943441 | | | code:0 08:27:33.724336 trace2/tr2_tgt_normal.c:123 atexit elapsed:2352.943475 code:0 08:27:33.724341 trace2/tr2_tgt_perf.c:213 | d0 | main | atexit | | 2352.943475 | | | code:0 ``` Removing the `--strict` from the invocation by disabling `transfer.fsckObjects` solves the problem - the process completes in less than a minute, and uses less than a GB of read IO. I can theorise why this operation is slightly expensive: - `--strict` causes `index-pack` to call `fsck_object()` on each object pushed - these large pushes that push 100K+ blobs actually touch almost every *tree* as well - so most/all of the 65K trees are pushed too. - calling `fsck_object` on a tree looks up all its children (blobs and trees) to ensure they're reachable [2] What I can't understand is why that makes it take quite *so* much longer and use so much IO. I think it *should* probably not be checking much about objects that are already in the repo, other than that they exist. We have multi-pack indexes enabled, so my assumption is that a "does object xyz exist?" check should be very inexpensive. What could I be missing here? As a start of a possible theory, we found when using libgit2 that our peculiar repo structure with so many trees requires that we expand the size of the tree cache[3] - otherwise repeated operations on blobs would cause tree cache misses every time their path was traversed. I wonder if there is a similar tree cache structure in git itself, and if so could it be relevant here? Many thanks and sorry about the long winded post :) Craig de Stigter Platform Engineer Koordinates references: [1]: https://kartproject.org [2]: fsck_walk_tree: https://github.com/git/git/blob/a0dda6023ed82b927fa205c474654699a5b07a82/fsck.c#L300 [3] GIT_OPT_SET_CACHE_OBJECT_LIMIT: https://github.com/libgit2/libgit2/blob/508361401fbb5d87118045eaeae3356a729131aa/include/git2/common.h#L266-L272