Re: [GSOC] How to improve the performance of git cat-file --batch

Christian Couder <christian.couder@xxxxxxxxx> · Wed, 28 Jul 2021 17:36:33 +0200

On Wed, Jul 28, 2021 at 3:38 PM ZheNing Hu <adlternative@xxxxxxxxx> wrote:

> Ok, therefore we need an accurate number of call times about lookup_object(),
> although the conclusion is obvious: 0 (upstream/master) and a big
> number (with my patch).

[...]

> This is the only 1 time left is printed by git.c, which show that after using
> my patch, we additionally call  lookup_object() when we use --batch option.
> According to the results of the previous gprof test: lookup_object()
> occupies 8.72%
> of the total time. (Though below you seem to think that the effect of
> gprof is not
> reliable enough.) This may be a place worthy of optimization.

First, yeah, I hadn't seen the "calls" columns in your gprof reports.
Sorry! It's nice to see that your manual check with a trace_printf()
function gives the same result as gprof about this though.

Anyway if you agree that it might be a place worthy of optimization,
then it might be a good idea to explain the reason for the numerous
lookup_object() calls when using the ref-filter code.

[...]

> > It would be nice if you could add a bit more details about how
> > lookup_object() is called (both before and after the changes that
> > degrade performance).
>
> After we letting git cat-file --batch reuse the logic of ref-filter,
> we will use get_object()
> to grab the object's data. Since we used atom %(raw), it will require
> us to grab the raw data
> of the object, oi->info.contentp will be set, parse_object_buffer() in
> get_object() will be
> called, parse_object_buffer() calls lookup_commit(), lookup_blob(),
> lookup_tree(),
> and lookup_tag(), they call lookup_object(). As we have seen,
> lookup_object() seems to
> take a lot of time.

Not sure why you are talking about %(raw). Is the root of the issue
that we now use atom %(raw), or how we implemented it?

> So let us think, can we skip this parse_object_buffer() in some scenarios?
> parse_object_buffer() parses the data of the object into a "struct
> object *obj", and then we use this
> obj feed to grab_values(), and then grab_values() feed obj to
> grab_tag_values() or grab_commit_values()
> to handle some logic about %(tag), %(type), %(object), %(tree),
> %(parent), %(numparent).
>
> But git cat-file --batch can avaid handle there atoms with default format.
>
> Therefore, maybe we can skip parsing object buffer if we really don't
> care about these atoms.

Yeah, maybe oi->info.contentp should be set only if the user specified
one of the atoms that really needs the content provided by
parse_object_buffer().

Thanks!