Re: GSoC Git Proposal Draft - ZheNing Hu

Jeff King <peff@xxxxxxxx> · Tue, 13 Apr 2021 02:40:41 -0400

On Sun, Apr 11, 2021 at 11:34:35PM +0800, ZheNing Hu wrote:

> > Why is Olga’s solution rejected?
> > 1. Olga's solution is to let `git cat-file` use the `ref-filter` interface,
> > the performance of `cat-file` appears to be degraded due "very eager to
> > allocate lots of separate strings" in `ref-filter` and other reasons.
> 
> I am thinking today whether we can directly append some object information
> directly to `&state->stack->output`, Instead of assigning to `v->s` firstly.

Yes, that's the direction I think we'd want to go.

> But in `cmp_ref_sorting()` we will use `get_ref_atom_value()`, It is possible
> to compare `v->s` of two different refs, I must goto fill object info in `v->s`.
> 
> So I think this is one of the reasons why `ref-filter` desires to
> allocate a large
> number of strings, right?

Yeah, I think sorting in general is a bit tricky, because it inherently
requires collecting the value for each item. Just thinking about what
properties an ideal solution would have (which we might not be able to
get all of):

  - if we're sorting by something numeric (e.g., an committer
    timestamp), we should avoid forming it into a string at all

  - if the sort item requires work to extract that overlaps with the
    output format (e.g., sorting by authordate and showing author name
    in the format, both of which require parsing the author ident line
    of a commit), ideally we'd just do that work once per ref/object.

  - if we are sorting, obviously we have to hold some amount of data for
    each item in memory all at once (since we have to get data on the
    sort properties for each, and then sort the result). So we'd
    probably need at least some allocation per ref anyway, and an extra
    string isn't too bad. But if we're not sorting, then it would be
    nice to consider one ref/object at a time, which lets us keep our
    peak memory usage lower, reuse output buffers, etc.

I think some of those are in competition with each other. Minimizing
work shared between the sorting and format steps means keeping more data
in memory. So it might be sensible to just treat them totally
independently, and not worry about sharing work (I haven't looked at how
ref-filter does this now).  TBH, I care a lot less about making the
"sorting" case fast than I do about making sure that if we _aren't_
sorting, we go as fast as possible.

-Peff