Re: [PATCH 6/6] hash-object: use fsck for object checks

Jeff King <peff@xxxxxxxx> · Wed, 18 Jan 2023 21:31:36 -0500

On Wed, Jan 18, 2023 at 04:34:02PM -0500, Taylor Blau wrote:

> That being said, let me play devil's advocate for a second. Do the new
> fsck checks slow anything in hash-object down significantly? If so, then
> it's plausible to imagine a hash-object caller who (a) doesn't use
> `--literally`, but (b) does care about throughput if they're writing a
> large number of objects at once.
> 
> I don't know if such a situation exists, or if these new fsck checks
> even slow hash-object down enough to care. But I didn't catch a
> discussion of this case in your series, so I figured I'd bring it up
> here just in case.

That's a really good point to bring up.

Prior to timing anything, here were my guesses:

  - it won't make a big difference either way because the time is
    dominated by computing sha1 anyway

  - we might actually be a little faster for commits and tags in the new
    code, because they aren't allocating structs for the pointed-to
    objects (trees, parents, etc). Nor stuffing them into obj_hash, so
    our total memory usage would be lower.

  - trees may be a little slower, because we're doing a more analysis on
    the filenames (sort order, various filesystem specific checks for
    .git, etc)

And here's what I timed, using linux.git. First I pulled out the raw
object data like so:

  mkdir -p commit tag tree

  git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)' |
  perl -alne 'print $F[1] unless $F[0] eq "blob"' |
  git cat-file --batch |
  perl -ne '
    /(\S+) (\S+) (\d+)/ or die "confusing: $_";
    my $dir = "$2/" . substr($1, 0, 2);
    my $fn = "$dir/" . substr($1, 2);
    mkdir($dir);
    open(my $fh, ">", $fn) or die "open($fn): $!";
    read(STDIN, my $buf, $3) or die "read($3): $!";
    print $fh $buf;
    read(STDIN, $buf, 1); # trailing newline
  '

And then I timed it like this:

  find commit -type f | sort >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t commit <input'

which yielded:

  Benchmark 1: ./git.old hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      7.264 s ±  0.142 s    [User: 4.129 s, System: 3.043 s]
    Range (min … max):    7.098 s …  7.558 s    10 runs

  Benchmark 2: ./git.new hash-object --stdin-paths -t commit <input
    Time (mean ± σ):      6.832 s ±  0.087 s    [User: 3.848 s, System: 2.901 s]
    Range (min … max):    6.752 s …  7.059 s    10 runs

  Summary
    './git.new hash-object --stdin-paths -t commit <input' ran
      1.06 ± 0.02 times faster than './git.old hash-object --stdin-paths -t commit <input'

So the new code is indeed faster, though really most of the time is
spent reading the data and computing the hash anyway. For comparison,
using --literally drops it to ~6.3s.

And according to massif, peak heap drops from 241MB to 80k. Which is
pretty good.

Trees are definitely slower, though. I reduced the number to fit in my
budget of patience:

  find tree -type f | sort | head -n 200000 >input
  hyperfine -L v old,new './git.{v} hash-object --stdin-paths -t tree <input'

And got:

  Benchmark 1: ./git.old hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      2.470 s ±  0.022 s    [User: 1.902 s, System: 0.549 s]
    Range (min … max):    2.442 s …  2.509 s    10 runs

  Benchmark 2: ./git.new hash-object --stdin-paths -t tree <input
    Time (mean ± σ):      3.244 s ±  0.026 s    [User: 2.661 s, System: 0.567 s]
    Range (min … max):    3.215 s …  3.295 s    10 runs

  Summary
    './git.old hash-object --stdin-paths -t tree <input' ran
      1.31 ± 0.02 times faster than './git.new hash-object --stdin-paths -t tree <input'

So we indeed got a bit slower (and --literally here is ~2.2s). It's
enough that it outweighs the benefits from the commits getting faster
(especially because there tend to be more trees than commits). But those
also get diluted by blobs (which have a lot of data to hash and free
fsck checks).

So in the end, I think nobody would really care that much. The absolute
numbers are pretty small, and this is already a fairly dumb way to get
objects into your repository. The usual way is via index-pack, and it
already uses the fsck code for its checks. But I do think it was a good
question to explore (plus it found a descriptor leak in hash-object,
which I sent a separate patch for).

-Peff