Re: RFC: New reference iteration paradigm

Jakub Narebski <jnareb@xxxxxxxxx> · Sat, 26 May 2018 19:25:32 +0200

Jeff King <peff@xxxxxxxx> writes:
> On Thu, Mar 31, 2016 at 11:01:44AM -0700, Junio C Hamano wrote:
>> Michael Haggerty <mhagger@xxxxxxxxxxxx> writes:
>> 
>>> the backend now has to implement
>>>
>>>> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
>>>>                                            const char *prefix,
>>>>                                            unsigned int flags);

The problem with callbacks, including their lack of composability, is
often solved in high-level languages by using Promises / Futures (or
Channels / Supplies).

But I think in this case implementing a straightforward Iterator pattern
would be a better and simpler solution, especially in C.

>>> The ref_iterator itself has to implement two main methods:
>>>
>>>> int iterator_advance_fn(struct ref_iterator *ref_iterator);
>>>> void iterator_free_fn(struct ref_iterator *ref_iterator);
>>>
>>> A loop over references now looks something like
>>>
>>>> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
>>>> while (ref_iterator_advance(iter)) {
>>>>         /* code using iter->refname, iter->oid, iter->flags */
>>>> }
>> 
>> We'd want to take advantage of the tree-like organization of the
>> refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
>> they are closer to each other than they are to refs/heads/a) so that
>> a request "I want to iterate only over tags, even though I may have
>> millions of other kinds of refs" can be done with cost that is
>> proportional to how many tags you have.
>> 
>> The current implementation of for_each_tag_ref() that goes down to
>> do_for_each_entry() in files-backend.c has that propertly, and the
>> new iteration mechanism with the above design seems to keep it,
>> which is very nice.
>
> Actually, that is a slight fiction. :)
>
> We traverse only the loose ref directories we need, but we populate the
> entire packed-refs tree in one go. That's fine if you have a reasonable
> number of refs and care about the cold-cache case (where you care a lot
> about hitting directories you don't need to, but reading the entire
> packed-refs file isn't a big deal). But it's really bad if you have an
> 800MB packed-refs file, as looking up one tiny subset of the entries
> wastes a lot of RAM and CPU pulling that into our internal
> representation[1].
>
> At one point I wrote a patch to binary search the packed-refs file, find
> the first "refs/tags/" entry, and then walk linearly through there. What
> stopped me is that the current refs.c code (I guess file-backend.c these
> days) was not happy with me partially filling in the ref_dir structs in
> this "inside out" way.
[...]

Isn't this what reftable - an alternative way of storing refs in Git,
tested by being used by JGit - would solve?  See Christian Couder post
"Implementing reftable in Git"

  https://public-inbox.org/git/CAP8UFD0PPZSjBnxCA7ez91vBuatcHKQ+JUWvTD1iHcXzPBjPBg@xxxxxxxxxxxxxx/t/#u

'Efficient lookup of an entire namespace, such as refs/tags/' is
allegedly one of the objectives of this format.

Regards,
-- 
Jakub Narębski