On Thu, Apr 6, 2017 at 1:37 PM, <git@xxxxxxxxxxxxxxxxx> wrote: > From: Jeff Hostetler <jeffhost@xxxxxxxxxxxxx> > > Teach traverse_trees_recursive() to not do redundant ODB > lookups when both directories refer to the same OID. And the reason for this is that omitting the second lookup saves time, i.e. a lookup in the ODB of a sufficiently large repo is slow. My kneejerk line of thinking: * yes, it sounds good to reduce the number of ODB accesses. * But if we consider ODB lookups to be slow and we perform a structured access, how about a cache in front of the ODB? * We already have that! (sort of..) 9a414486d9 (lookup_object: prioritize recently found objects, 2013-05-01) * Instead of improving the caching, maybe change the size of the problem: We could keep the objects of different types in different hash-tables. object.c has its own hash table, I presume for historical and performance reasons, this would be split up to multiple hash tables. Additionally to "object *lookup_object(*sha1)", we'd have a function "object *lookup_object(*sha1, enum object_type hint)" which looks into the correct the hash table. If you were to call just lookup_object with no hint, then you'd look into all the different tables (I guess there is a preferrable order in which to look, haven't thought about that). > > In operations such as read-tree, checkout, and merge when > the differences between the commits are relatively small, > there will likely be many directories that have the same > SHA-1. In these cases we can avoid hitting the ODB multiple > times for the same SHA-1. This would explain partially why there was such a good performance boost in the referenced commit above as we implicitly lookup the same object multiple times. Peff is really into getting this part faster, c.f. https://public-inbox.org/git/20160914235547.h3n2otje2hec6u7k@xxxxxxxxxxxxxxxxxxxxx/ > TODO This change is a first attempt to test that by comparing > TODO the hashes of name[i] and name[i-i] and simply copying > TODO the tree-descriptor data. I was thinking of the n=2 > TODO case here. We may want to extend this to the n=3 case. > > ================ > On the Windows repo (500K trees, 3.1M files, 450MB index), > this reduced the overall time by 0.75 seconds when cycling > between 2 commits with a single file difference. > > (avg) before: 22.699 > (avg) after: 21.955 > =============== So it shaves off 4% of the time needed. it doesn't sound like a break through, but I guess these small patches add up. :) > for (i = 0; i < n; i++, dirmask >>= 1) { > - const unsigned char *sha1 = NULL; > - if (dirmask & 1) > - sha1 = names[i].oid->hash; > - buf[i] = fill_tree_descriptor(t+i, sha1); > + if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid && > + !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) { Why do we need to check for dirmask & 1 here? This ought to be covered by the hashcmp already IIUC. So maybe we can pull out the if (dirmask & 1) sha1 = names[i].oid->hash; out of the else when dropping that dirmask check? Thanks, Stefan