Re: [Question] Can git cat-file have a type filtering option?

Jeff King <peff@xxxxxxxx> · Mon, 10 Apr 2023 16:14:14 -0400

On Sun, Apr 09, 2023 at 02:47:30PM +0800, ZheNing Hu wrote:

> > Perhaps slightly so, since there is naturally going to be some
> > duplicated effort spawning processes, loading any shared libraries,
> > initializing the repository and reading its configuration, etc.
> >
> > But I'd wager that these are all a negligible cost when compared to the
> > time we'll have to spend reading, inflating, and printing out all of the
> > objects in your repository.
> 
> "What you said makes sense. I implemented the --type-filter option for
> git cat-file and compared the performance of outputting all blobs in the
> git repository with and without using the type-filter. I found that the
> difference was not significant.
> 
> time git  cat-file --batch-all-objects --batch-check="%(objectname)
> %(objecttype)" |
> awk '{ if ($2 == "blob") print $1 }' | git cat-file --batch > /dev/null
> 17.10s user 0.27s system 102% cpu 16.987 total
> 
> time git cat-file --batch-all-objects --batch --type-filter=blob >/dev/null
> 16.74s user 0.19s system 95% cpu 17.655 total
> 
> At first, I thought the processes that provide all blob oids by using
> git rev-list or git cat-file --batch-all-objects --batch-check might waste
> cpu, io, memory resources because they need to read a large number
> of objects, and then they are read again by git cat-file --batch.
> However, it seems that this is not actually the bottleneck in performance.

Yeah, I think most of your time there is spent on the --batch command
itself, which is just putting through a lot of bytes. You might also try
with "--unordered". The default ordering for --batch-all-objects is in
sha1 order, which has pretty bad locality characteristics for delta
caching. Using --unordered goes in pack-order, which should be optimal.

E.g., in git.git, running:

  time \
    git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
    perl -lne 'print $1 if /^blob (.*)/' |
    git cat-file --batch >/dev/null

takes:

  real	0m29.961s
  user	0m29.128s
  sys	0m1.461s

Adding "--unordered" to the initial cat-file gives:

  real	0m1.970s
  user	0m2.170s
  sys	0m0.126s

So reducing the size of the actual --batch printing may make the
relative cost of using multiple processes much higher (I didn't apply
your --type-filter patches to test myself).

In general, I do think having a processing pipeline like this is OK, as
it's pretty flexible. But especially for smaller queries (even ones that
don't ask for the whole object contents), the per-object lookup costs
can start to dominate (especially in a repository that hasn't been
recently packed). Right now, even your "--batch --type-filter" example
is probably making at least two lookups per object, because we don't
have a way to open a "handle" to an object to check its type, and then
extract the contents conditionally. And of course with multiple
processes, we're naturally doing a separate lookup in each one.

So a nice thing about being able to do the filtering in one process is
that we could _eventually_ do it all with one object lookup. But I'd
probably wait on adding something like --type-filter until we have an
internal single-lookup API, and then we could time it to see how much
speedup we can get.

-Peff