Re: [RFC PATCH 0/3] replay: implement support for writing new objects to a pack

Elijah Newren <newren@xxxxxxxxx> · Fri, 10 Nov 2023 19:42:01 -0800

On Tue, Nov 7, 2023 at 10:22 AM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> (Based on a combination of Christian's cc/git-replay and my
> tb/merge-tree-write-pack branches).
>
> This RFC demonstrates extending the new `--write-pack` option that
> `merge-tree` recently learned to the `replay` builtin as well.
>
> The approach is as follows:
>
>   - write a pack out after each step in the replay operation, so that
>     subsequent steps may see any new object(s) created during previous
>     steps
>
>   - combine those packs into one before migrating them back into the
>     main object store
>
> This is accomplished with a combination of the bulk-checkin and
> tmp-objdir APIs, with some minor modifications made to when we flush out
> and finalize bulk-checkin transactions.
>
> The benefit to this approach is that we bound the number of inodes
> required per replayed commit to a constant (by default, 3: one for the
> .pack, one for the .idx, and another for the .rev file), instead of
> having each operation take an unbounded number of inodes proportional to
> the number of new objects created during that step. We also only migrate
> a single pack back to the main object store.

Isn't it actually 4?  Since you only put blobs and trees into the bulk
checkin packfiles, the commit object will still be loose.

> In other words, this makes the maximum number of inodes required by
> 'replay' grow proportional to the number of commits being replayed,
> instead of the number of new *objects* created as a result of the replay
> operation.

As per comments on the other series, we actually need 2 packs per
commit (when replaying non-merge commits, we don't have to worry about
recursive merges, so 2*depth is just 2), so this would be 7 inodes per
commit (3 files per pack * 2 packs + 1 loose commit object).

I was curious what the comparative number of loose objects might be if
we didn't use the bulk checking, so:
$ cd linux.git
$ git rev-list --no-merges --count HEAD
1141436
$ time git log --oneline --no-merges --name-only | wc -l
3781628
$ python -c 'print(3781628/1141436)'
3.3130442705504293

So, if repositories are similar to linux, that's a bit over 3 files
modified per commit.  It's a bit harder to count trees, but let's take
a wild guess and say that each file is 7 directories (because I'm too
lazy to do real research and 7 happens to give me nice round numbers
later), so that's up to 7*3 tree objects.  So 3 file objects + 21 tree
objects + 1 commit object = 25 loose objects.

So, your scheme certainly seems to reduce number of inodes, but does
it introduce a different scaling issue?  git-gc repacks when there are
>= 50 packs, or when there are >= 6700 loose objects, suggesting that
if we exceed those numbers, we might start seeing performance suffer.
We would hit 50 packs with a mere 25 commits being replayed, and
wouldn't expect to get to 6700 loose objects until we replayed about
268 commits (6700/25).  Does this mean we are risking worse
performance degradation with this scheme than with just using loose
objects until the end of the operation?