Hi Michael, Michael Haggerty writes: > Therefore, this patch series changes the data structure used to store > the reference cache from a single array of pointers-to-struct into a > tree-like structure in which each subdirectory of the reference > namespace is stored as an array of pointers-to-entry and entries can > be either references or subdirectories containing more references. Very nice! I like the idea. Can't wait to start reading the series. > * refs/replace is almost *always* needed even though it often > doesn't even exist. Thus the presence of many loose references > slows down *many* git commands for no reason whatsoever. Was this one of your primary inspirations for writing this series? > * When a new reference is created, is_refname_available() is called > to see whether there is another another reference whose name > conflicts with the new one. Currently this loads and iterates > through *all* references. But there are only a few refnames that > can possibly conflict; for example, given the refname > "refs/heads/foo/bar", the only possible conflicts are with > "refs/heads/foo" and "refs/heads/foo/bar/*". Therefore it is > silly to load and iterate through the whole refname hierarchy. Hm, the original design does sound quite sub-optimal. I suppose it was written when Git didn't have so many refs in so many subdirectories. > * "git for-each-ref" is capable of searching a subtree of the > references. But currently this causes all references to be > loaded. Ah. I was using git for-each-ref to write a filter-branch like thing earlier, and I was wondering why it was so slow. > * the time to create a new branch goes from 180 ms to less than 10 ms > (my test resolution only includes two decimal places) and the time > to checkout a new branch does the same. I'm interested in seeing how the callgraph changed. Assuming you used Valgrind to profile it, could you publish the outputs? > * the time for a "git filter-branch" of all commits (which used to > scale like N^2) goes from 4 hours to 13 minutes. (Since > filter-branch necessarily *creates* lots of loose references, the > savings are also there if the references are originally packed.) This is seriously awesome. > The efficiency gains are such that some operations are now faster with > loose references than with packed references; however, some operations > with packed references slow down a bit. Curiously, why do operations with packed references slow down? (I'll probably find out in a few minutes after reading the series, but I'm asking anyway because it it's very non-obvious to me now) > These changes do not increase the amount of space per reference needed > for the reference cache, but they do add one similarly-sized entry for > each subdirectory (for each of loose and packed). I don't think that > the space increase should be significant in any reasonable situation. > > After these changes, there is a benefit to sharding the reference > namespace, especially for loose references. Hm, I wonder what this means for Git hosting services. > Patches 11-24 change most of the internal functions to work with > "struct ref_entry *" (namely the kind of ref_entry that holds a > directory of references) instead of "struct ref_dir *". The reason > for this change it to allow these functions access to the "flag" and > "name" fields that are stored in ref_entry and thereby avoid having to > store redundant information in "struct ref_dir" (which would increase > the size of *every* ref_entry because of its presence in the union). Hm, I was wondering why the series was looking so intimidating. Is it not possible to squash all (or atleast some) of these together? > From: Michael Haggerty <mhagger@xxxxxxxxxxxx> Nit: Can't you configure your email client to put this in the "From: " header of your emails? Thanks for the interesting read. -- Ram -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html