Re: RFC: New reference iteration paradigm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 31, 2016 at 11:01:44AM -0700, Junio C Hamano wrote:

> Michael Haggerty <mhagger@xxxxxxxxxxxx> writes:
> 
> > the backend now has to implement
> >
> >> struct ref_iterator *ref_iterator_begin_fn(const char *submodule,
> >>                                            const char *prefix,
> >>                                            unsigned int flags);
> >
> > The ref_iterator itself has to implement two main methods:
> >
> >> int iterator_advance_fn(struct ref_iterator *ref_iterator);
> >> void iterator_free_fn(struct ref_iterator *ref_iterator);
> >
> > A loop over references now looks something like
> >
> >> struct ref_iterator *iter = each_ref_in_iterator("refs/tags/");
> >> while (ref_iterator_advance(iter)) {
> >>         /* code using iter->refname, iter->oid, iter->flags */
> >> }
> 
> We'd want to take advantage of the tree-like organization of the
> refs (i.e. refs/tags/a and refs/tags/b sit next to each other and
> they are closer to each other than they are to refs/heads/a) so that
> a request "I want to iterate only over tags, even though I may have
> millions of other kinds of refs" can be done with cost that is
> proportional to how many tags you have.
> 
> The current implementation of for_each_tag_ref() that goes down to
> do_for_each_entry() in files-backend.c has that propertly, and the
> new iteration mechanism with the above design seems to keep it,
> which is very nice.

Actually, that is a slight fiction. :)

We traverse only the loose ref directories we need, but we populate the
entire packed-refs tree in one go. That's fine if you have a reasonable
number of refs and care about the cold-cache case (where you care a lot
about hitting directories you don't need to, but reading the entire
packed-refs file isn't a big deal). But it's really bad if you have an
800MB packed-refs file, as looking up one tiny subset of the entries
wastes a lot of RAM and CPU pulling that into our internal
representation[1].

At one point I wrote a patch to binary search the packed-refs file, find
the first "refs/tags/" entry, and then walk linearly through there. What
stopped me is that the current refs.c code (I guess file-backend.c these
days) was not happy with me partially filling in the ref_dir structs in
this "inside out" way.

That is, they assume we may have iterated "refs/", but not yet dove into
"refs/tags/" (because that's what makes sense for traversing loose
refs). But they are not happy if you already dove into "refs/tags/", but
don't know what else might be present in "refs/".

I wonder if, while Michael is fiddling with the iterator code, it might
be possible to fix that (it's not a fundamental problem, just one with
the way the ref-caching works right now).

-Peff

[1] As you probably guessed, I run into these giant packed-refs files as
    part of our "alternates" storage for multiple forks of the same
    repository. So all of the refs match the pattern
    "refs/remotes/$fork_id/*", and I really would like to be able to
    iterate the refs for just one fork at a time.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux