On Thu, Jul 07, 2011 at 12:34:37PM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > You could "cheat" and instead of storing the sha1 of a blob object in > > the notes tree, use the lower 32 bits to store an actual value. I don't > > think that currently breaks any assumptions in the notes code, but it > > definitely is against the intent of it. > > I highly suspect that it would break fsck rather badly. You may not even > be able to repack a repository with such a notes tree. True. I think you would have to do the file-mode hack that Jakub suggested. But that's getting pretty gross. If something isn't big enough to be in a blob, and especially if we are just caching, it would be nice to have some lighter-weight caching mechanism. > > For a local lookup cache, I would use a fixed-size binary integer just > > to keep the lookup data structure simple (then you know the width of > > each record ahead of time). For a generation commit header, obviously we > > would go with the ascii representation as we do for other headers. So I implemented something like this today. In fact, it's a generic[1] fast persistent object-data mapping for data of a fixed size. The on-disk representation is a stream of pairs: binary sha1s followed by their fixed-size data. Lookup is by binary search (using sha1_entry_pos, which makes this more or less the same as pack-index lookups). There's a separate in-memory lookaside table that receives updates. These are stored as a hash[2] because except for the first run, this will typically be much smaller than the disk version, and we care more about insertion speed here. When git exits, the memory and disk versions are merged into a new cache which atomically replaces the old version via rename(). Here are the timings I came up with using it on top of my depth-first contains algorithm. All runs are for "git tag --contains HEAD~1000" in the linux-2.6 repo. All times are best-of-five unless otherwise noted. To get a baseline, I measured the algorithm with no cutoff at all (i.e., ffc4b80 in pu), and then with a cutoff based on timestamp with one day of slop (i.e., de9f14e in pu): none: real 0m3.139s user 0m3.044s sys 0m0.092s timestamp: real 0m0.027s user 0m0.024s sys 0m0.000s We can use the "timestamp" value as our goal; it's fast, but not necessarily correct in the face of skew (and it's about as fast as we would expect a generation header inside the commit to perform). We can use "none" as a lower goalpost. It's correct, but slow. If we're slower than it, then we have totally failed. Then I tried doing a generation-based cutoff, caching the generations via notes-cache. Here are those timings: notes (1st run): real 0m14.153s user 0m7.868s sys 0m5.392s notes (before repack): real 0m0.102s user 0m0.076s sys 0m0.024s notes (after repack): real 0m0.090s user 0m0.072s sys 0m0.016s It's pretty painful to actually generate the cache, mostly because we end up writing a ton of tree and blob objects. The objects directory balloons from 503M to 1.1G after that run. Repacking brings that down to a mere 524M, or 21M spent on the cache. Not shown in these timings is the painfully slow "git gc" it took to get there. So there's a nice speedup over the no-cutoff case, but we're still 3 times as slow as the timestamp case. And the sheer amount of object cruft (both in terms of wasted space, and wasted time writing and repacking) is ugly. Next up is the custom object-cache code: custom (1st run): real 0m3.769s user 0m3.404s sys 0m0.360s custom: real 0m0.035s user 0m0.028s sys 0m0.004s You can see that the first run is a bit slower, as we have to touch every commit to figure out the generations. But it also highlights how much of the notes-cache version is spent not actually figuring out the generations, but rather just writing the notes tree. Subsequent runs are pretty darn fast. It's a tiny bit slower than using the timestamps, but it's within the noise. The resulting cache file is 5.9M. So it seems like a good direction to pursue. The only downside I see is that we may be slower operating in a read-only repository in which nobody has generated any cache yet. But that seems like a bit of a crazy case, and even then, it's on par with the no-cutoff-at-all case, so it's really not that bad. And it's guaranteed to be correct in the face of skew, as opposed to the fast timestamp case. -Peff [1] I intentionally wrote the object caching code in a very generic, data-agnostic way. I have a patch series to speed up git-cherry by caching patch-ids of commits against their parents. It uses notes-cache and already provides some speedup, but I'd like to see if I can make it faster with the new code. [2] Instead of writing my own hash, I hacked decorate.[ch] to have "value" semantics. I.e., you can now store values of arbitrary size. The existing semantics of storing a "void *" are easy to do on top of that. I noticed that fast-export is already encoding uint32_t's inside the pointers. This makes that a little more supported, and also means that the same hack will work for data larger than a void pointer (e.g., patch-id caching will need 20 bytes). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html