`git index-pack --strict` is very slow during pushes to large repos

Craig de Stigter <craig.destigter@xxxxxxxxxxxxxxx> · Mon, 10 May 2021 08:52:19 +1200

Hey folks

(apologies if repost; my first post seemed to disappear entirely)

We're hosting a service with some fairly large repos (created by
Kart[1] ), and I've been looking into some poor
performance of `git push` on our service.

Background: We host repositories with a specific layout. I'll try and avoid
most of the technical details but a brief description of the repo layout
might be helpful:

- At each revision we have 256 trees
      - each containing 256 trees (so 65536 trees at this level)
      - each subtree contains a number of objects (distributed via a hash
      scheme, evenly across the subtrees)
- Some repos have up to 100 million blobs active in a given revision.
In that case each of the 65536 subtrees would contain ~1500 blobs.
- Blobs are usually a few bytes to a few KB in size.
- For various reasons we have disabled deltas entirely.
- Most repos have a few hundred commits, and a typical commit might
modify 100,000 features (again spread evenly across the 65536 trees),
thus modifying most of the trees also.
- Our largest repos are currently a few hundred GB on disk.

We've come across a curious performance issue with `git index-pack` when
invoked by `receive-pack` during a push operation. We have
`transfer.fsckObjects=true` in the server config, so the index-pack
invocation looks like:

```
git --shallow-file shallow_filename index-pack \
   --stdin --keep='receive-pack 1234 on <servername>' \
   --show-resolving-progress --report-end-of-input --fix-thin \
   --strict
```

For our largest repos, when pushing ~100K blobs and associated trees, this
takes a *long* time - sometimes over 12 hours. The process uses enormous
amounts of disk IO (all reads; I haven't measured how much per process, but
the server was doing many terabytes of IO in total)

Here is one that "only" took 45 minutes with a few tracing environment vars
enabled:

```
$ cat craig.pack | /opt/sno/libexec/git-core/git --shallow-file
myfilename index-pack --stdin --keep='receive-pack 159567 on
servername' --show-resolving-progress --report-end-of-input --fix-thin
--strict
07:48:20.781099 common-main.c:48                  version 2.29.2
07:48:20.781111 common-main.c:48             | d0 | main
      | version      |     |           |           |              |
2.29.2
07:48:20.781127 common-main.c:49                  start
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781133 common-main.c:49             | d0 | main
      | start        |     |  0.000264 |           |              |
/opt/sno/libexec/git-core/git --shallow-file indexed.pack index-pack
--stdin '--keep=receive-pack 159567 on cave-7dc7798cc9-qcvxd'
--show-resolving-progress --report-end-of-input --fix-thin --strict
07:48:20.781296 git.c:444               trace: built-in: git
index-pack --stdin '--keep=receive-pack 159567 on
cave-7dc7798cc9-qcvxd' --show-resolving-progress --report-end-of-input
--fix-thin --strict
07:48:20.781306 git.c:445                         cmd_name index-pack
(index-pack)
07:48:20.781312 git.c:445                    | d0 | main
      | cmd_name     |     |           |           |              |
index-pack (index-pack)
07:48:20.781530 midx.c:184                   | d0 | main
      | data         | r0  |  0.000670 |  0.000670 | midx         |
load/num_packs:1
07:48:20.781542 midx.c:185                   | d0 | main
      | data         | r0  |  0.000683 |  0.000683 | midx         |
load/num_objects:42658742
pack    5aa14bbb43187b7dfd5f996514854c3dcdc66d71
08:27:33.724306 git.c:700                         exit
elapsed:2352.943441 code:0
08:27:33.724321 git.c:700                    | d0 | main
      | exit         |     | 2352.943441 |           |              |
code:0
08:27:33.724336 trace2/tr2_tgt_normal.c:123       atexit
elapsed:2352.943475 code:0
08:27:33.724341 trace2/tr2_tgt_perf.c:213    | d0 | main
      | atexit       |     | 2352.943475 |           |              |
code:0
```

Removing the `--strict` from the invocation by disabling
`transfer.fsckObjects` solves the problem - the process completes in less
than a minute, and uses less than a GB of read IO.

I can theorise why this operation is slightly expensive:

   - `--strict` causes `index-pack` to call `fsck_object()` on each object
   pushed
   - these large pushes that push 100K+ blobs actually touch almost every
   *tree* as well - so most/all of the 65K trees are pushed too.
   - calling `fsck_object` on a tree looks up all its children (blobs and
   trees) to ensure they're reachable [2]

What I can't understand is why that makes it take quite *so* much longer
and use so much IO. I think it *should* probably not be checking much about
objects that are already in the repo, other than that they exist. We
have multi-pack indexes enabled, so my assumption is that a "does
object xyz exist?" check should be very inexpensive.
What could I be missing here?

As a start of a possible theory, we found when using libgit2 that our
peculiar repo structure with so many trees requires that we expand the size
of the tree cache[3] - otherwise repeated operations on blobs would
cause tree cache misses
every time their path was traversed. I wonder if there is a similar tree
cache structure in git itself, and if so could it be relevant here?

Many thanks and sorry about the long winded post :)

Craig de Stigter
Platform Engineer
Koordinates

references:
[1]: https://kartproject.org
[2]: fsck_walk_tree:
https://github.com/git/git/blob/a0dda6023ed82b927fa205c474654699a5b07a82/fsck.c#L300
[3] GIT_OPT_SET_CACHE_OBJECT_LIMIT:
https://github.com/libgit2/libgit2/blob/508361401fbb5d87118045eaeae3356a729131aa/include/git2/common.h#L266-L272

`git index-pack --strict` is *very* slow during pushes to large repos

`git index-pack --strict` is very slow during pushes to large repos