Re: [GSOC] How to improve the performance of git cat-file --batch

Christian Couder <christian.couder@xxxxxxxxx> · Wed, 28 Jul 2021 09:34:39 +0200

On Tue, Jul 27, 2021 at 3:37 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> 于2021年7月26日周一 下午5:38写道：
> >
> > On Sun, Jul 25, 2021 at 2:04 PM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
> > > Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> 于2021年7月25日周日 上午5:23写道：
> >
> > > > Having skimmed it I'm a bit confused about this in reference to
> > > > performance generally. I haven't looked into the case you're discussing,
> > > > but as I noted in
> > > > https://lore.kernel.org/git/87im1p6x34.fsf@xxxxxxxxxxxxxxxxxxx/ the
> > > > profiling clearly shows that the main problem is that you've added
> > > > object lookups we skipped before.
> > >
> > > Yeah, you showed me last time that lookup_object() took up a lot of time.
> >
> > Could the document explain with some details why there are more calls
> > to lookup_object()?

Please note that here we are looking for the number of times the
lookup_object() function is called. This means that to measure that
properly, it might actually be better to have some way to count this
number of times the lookup_object() function is called, rather than
count the time spent in the function.

For example you could add a trace_printf(...) call in the
lookup_object() function, set GIT_TRACE=/tmp/git_trace.log, and then
just run `git cat-file --batch ...` and count the number of times the
new trace from lookup_object() appears in the log file.

> > For example it could take an example `git cat-file
> > --batch ...` command (if possible a simple one), and say which
> > functions like lookup_object() it was using (and how many times) to
> > get the data it needs before using the ref-filter logic, and then the
> > same information after using the ref-filter logic.
>
> Sorry but this time I use gprof but can’t observe the same effect as before.
> lookup_object() is indeed a part of the time overhead, but its proportion is
> not very large this time.

I am not sure gprof is a good tool for this. It looks like it tries to
attribute spent times to functions by splitting time between many low
level functions, and it doesn't seem like the right approach to me.
For example if lookup_object() is called 5% more often, it could be
that the excess time is attributed to some low level functions and not
to lookup_object() itself.

That's why we might get a more accurate view of what happens by just
counting the number of time the function is called.

> > It could be nice if there were also some data about how much time used
> > to be spent in lookup_object() and how much time is now spent there,
> > and how this compares with the whole slowdown we are seeing. If Ævar
> > already showed that, you can of course reuse what he already did.

Now I regret having wrote the above, sorry, as it might not be the
best way to look at this.

> This is my test for git cat-file --batch --batch-all-objects >/dev/null:

[...]

> Because we called parse_object_buffer() in get_object(), lookup_object()
> is called indirectly...

It would be nice if you could add a bit more details about how
lookup_object() is called (both before and after the changes that
degrade performance).

> We can see that some functions are called the same times:

When you say "the same times" I guess you mean that the same amount of
time is spent in these functions.

> patch_delta(),
> unpack_entry(), hashmap_remove()... But after using my patch,
> format_ref_array_item(), grab_sub_body_contents(), get_object(), lookup_object()
> begin to occupy a certain proportion.

Thanks!