Re: [PATCH 00/28] Store references hierarchically in cache

Ramkumar Ramachandra <artagnon@xxxxxxxxx> · Fri, 28 Oct 2011 18:37:13 +0530

Hi Michael,

Michael Haggerty writes:
> Therefore, this patch series changes the data structure used to store
> the reference cache from a single array of pointers-to-struct into a
> tree-like structure in which each subdirectory of the reference
> namespace is stored as an array of pointers-to-entry and entries can
> be either references or subdirectories containing more references.

Very nice! I like the idea. Can't wait to start reading the series.

>  * refs/replace is almost *always* needed even though it often
>    doesn't even exist.  Thus the presence of many loose references
>    slows down *many* git commands for no reason whatsoever.

Was this one of your primary inspirations for writing this series?

>  * When a new reference is created, is_refname_available() is called
>    to see whether there is another another reference whose name
>    conflicts with the new one.  Currently this loads and iterates
>    through *all* references.  But there are only a few refnames that
>    can possibly conflict; for example, given the refname
>    "refs/heads/foo/bar", the only possible conflicts are with
>    "refs/heads/foo" and "refs/heads/foo/bar/*".  Therefore it is
>    silly to load and iterate through the whole refname hierarchy.

Hm, the original design does sound quite sub-optimal.  I suppose it
was written when Git didn't have so many refs in so many
subdirectories.

>  * "git for-each-ref" is capable of searching a subtree of the
>    references.  But currently this causes all references to be
>    loaded.

Ah.  I was using git for-each-ref to write a filter-branch like thing
earlier, and I was wondering why it was so slow.

> * the time to create a new branch goes from 180 ms to less than 10 ms
>  (my test resolution only includes two decimal places) and the time
>  to checkout a new branch does the same.

I'm interested in seeing how the callgraph changed.  Assuming you used
Valgrind to profile it, could you publish the outputs?

> * the time for a "git filter-branch" of all commits (which used to
>  scale like N^2) goes from 4 hours to 13 minutes.  (Since
>  filter-branch necessarily *creates* lots of loose references, the
>  savings are also there if the references are originally packed.)

This is seriously awesome.

> The efficiency gains are such that some operations are now faster with
> loose references than with packed references; however, some operations
> with packed references slow down a bit.

Curiously, why do operations with packed references slow down?  (I'll
probably find out in a few minutes after reading the series, but I'm
asking anyway because it it's very non-obvious to me now)

> These changes do not increase the amount of space per reference needed
> for the reference cache, but they do add one similarly-sized entry for
> each subdirectory (for each of loose and packed).  I don't think that
> the space increase should be significant in any reasonable situation.
>
> After these changes, there is a benefit to sharding the reference
> namespace, especially for loose references.

Hm, I wonder what this means for Git hosting services.

> Patches 11-24 change most of the internal functions to work with
> "struct ref_entry *" (namely the kind of ref_entry that holds a
> directory of references) instead of "struct ref_dir *".  The reason
> for this change it to allow these functions access to the "flag" and
> "name" fields that are stored in ref_entry and thereby avoid having to
> store redundant information in "struct ref_dir" (which would increase
> the size of *every* ref_entry because of its presence in the union).

Hm, I was wondering why the series was looking so intimidating.  Is it
not possible to squash all (or atleast some) of these together?

> From: Michael Haggerty <mhagger@xxxxxxxxxxxx>

Nit: Can't you configure your email client to put this in the "From: "
header of your emails?

Thanks for the interesting read.

-- Ram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html