Re: [GSoC] Git Blog 12

ZheNing Hu <adlternative@xxxxxxxxx> · Tue, 10 Aug 2021 22:20:21 +0800

Christian Couder <christian.couder@xxxxxxxxx> 于2021年8月10日周二 下午4:04写道：

> > parse_object_buffer(), let's take a look at the result of gprof again:
> >
> > We need to call grab_sub_body_contents(), grab_person() to rescan the
> > buffer and extract the data.
> > What if we can combine these multiple scanning and parsing into one completion?
> > At least intuitively, this has an opportunity to improve performance.
>
> Yeah, but is there a way to check that we indeed scan or parse the
> same objects multiple times? This way we might get an idea about how
> much scanning and parsing we could save.
>

I think find_subpos() called by grab_sub_body_contents() and find_wholine()
called by grab_person() are evidences that we are repeating iteratively.
But the proportion of time they occupy is too small. 0.0142% and 0.0109%

Sorry, but my attempts over the past two days have not gone well, the changes
here will make the program very complicated, the optimization here is not worth
doing.

> > So I check the implementation
> > details of `parse_commit_buffer()` and `parse_tag_buffer()`, maybe we
> > can pass some "hook pointer"
> > to these parsing functions like `oid_object_info_extended()` does to
> > extract only the information we need?
>
> Would this also avoid scanning and parsing the same object many times?
>

oid_object_info_extended()? I think it can set the pointer and extract
the required
value. Well, the problem it solves may be a little different from here.

> > The commit-slab caught my attention. It can be used to get some
> > specified data content from the object.
>
> I thought it was for storing commit data in an efficient way.
>

Yeah.

> > I am thinking about whether it is possible to design a `struct
> > object_view` (temporarily called
> > `struct commit_view`) to store the offset of the parsed data in the
> > object content. `parse_commit_buffer()`
> > will check whether we need something for in-depth parsing. Like this:
> >
> > ```c
> > struct commit_view {
> > int need_tree : 1;
> > int need_parents : 1;
> >
> > int need_author : 1;
> > int need_author_name : 1;
> > int need_author_email : 1;
> > int need_author_date : 1;
> >
> > int need_committer : 1;
> > int need_committer_name : 1;
> > int need_committer_email : 1;
> > int need_committer_date : 1;
>
> Is the above info specific for each commit? Or will the above be the
> same for all the commits we are processing?
>

According to the my previous thoughts, I think it is same for all commits.

>
> Ok, so the idea is to use the commit slab feature to store the struct
> commit_view instances. That seems reasonable to me.
>

It's a pity that it may not bring much optimization in real situations.

> > It seems that GSOC has only the last few weeks left, I'm not sure how
> > far this patch series is from
> > being merged by the master branch. Performance optimization may have
> > no end.
>
> Yeah, but the idea for now is just to make using the ref-filter code
> as fast as the current code.
>

It seems difficult to achieve consistent performance. As we discussed
before, the
previous `git cat-file --batch` can only provide a few kinds of
metadata that does not
need to be parsed, and after using ref-filter logic allows cat-file to
use more structured
information about git objects. But this means it needs a harder path,
It requires many
traversals and many copies, If we really use the logic in ref-filter,
we can only do partial
optimization, we can't expect it to be as fast as the old function.
Unless we have the
opportunity to not use the logic in ref-filter, but use the new atoms
in ref-filter, this may
has a chance to escape the copy in the ref-filter. I don't know what
your opinion are...

Thanks.
--
ZheNing Hu