Nicolas Pitre <nico@xxxxxxx> wrote: > On Fri, 16 Mar 2007, Linus Torvalds wrote: > > I also didn't worry about it, because I felt that if it became a problem, > > it would be easy to just add a cache of base objects (we probably do *not* > > want to keep the whole unpacked object info in memory all the time just > > because of memory pressure issues, so "cache of base objects" is better). > > However, the "pack file + offset" thing makes it harder to do, since we > > now don't even have the SHA1 of the base object before we unpack it. > > > > But I guess we could just index this by a <packfile, offset> tuple. ... > Then it would only be a matter of coming up with a clever cache > eviction algorithm. Yes. Linus above seems to imply (at least to me) that we wouldn't want to cache the original object requested by read_sha1_file(), as its not the delta base. But given our packing rules, we should be (in general anyway) first asking for the most recent revision of a file, which is stored whole, then for an older revision, which will be a delta of the more recent revision we just saw. Hence we probably would want to cache an object. Well, at least anything that had been packed as a delta. Caching a deflated OBJ_BLOB may not be worth it. > > Anyway, I bet that this is a much bigger issue than the pack format > > itself (and is largely independent). > > Well, I think the pack format issue is significant too. But because > those are independent issues the gain in performance will be additive. I'm torn there. There's two places that we do lots of unpacks of objects where we run into this difficult case of unpacking the same base object many times: git-blame and a rev-list with a path limiter. Now the git-blame case is obvious: we are constantly unpacking various revisions of the same file, and these are probably delta'd against each other, so the unpacking gets really brutal after a while. A blob cache here would probably *really* help out git-blame. What's slightly less obvious about git-blame is we are probably also traversing the different versions of the same trees over and over, as we resolve the path to the correct blob in each commit we traverse. So again here we are hitting lots of the same trees multiple times. That last part about git-blame also obviously applies to the rev-list with a path limiter. But most other operations don't seem like they would benefit from a base object cache; actually they might slow down from having such a cache present! Commits tend not to delta well; if they delta it is a very rare occurrance. So we aren't getting huge unpacking benefits there by caching them. Scratch any benefit of the cache for any sort of rev-list operation that doesn't require tree access. As for the other common operations (diff, read-tree, checkout-index, merge-recursive): I don't think these will benefit from a cache either. Their data access patterns are pretty spread out over the tree. With the exception of rename detection we hit everything only once. After touching a path, we tend to not go back to it. So unless we are really lucky and one blob acts as a base object for many others at different paths (possible, but I suspect not very likely) its not worth caching the base. If we do hit something twice, its probably because we are doing two distinct passes over the data. In this case the passes are probably because we either don't want to hold all of the data in memory (too big of a set for some projects) or because we tried one algorithm, failed, and are now trying a different one (internal read-tree in merge-recursive). Caching in merge-recursive may help, but just making the dirty cache (index) that resulted from the internal read-tree available for the remainder of the merge-recursive process might be faster; especially if we only have one base and don't need to recursively merge multiple bases. So where does that leave us? The only places I see a base object cache really helping is in git-blame for blob access, repeated tree access (git-blame and path limiting), and maybe we could do better with the common cases in merge-recursive by being smarter with the cache. But with pack v4 I don't think I need a tree object cache. With a 6 byte fixed record format, a strict ordering requirement, a finite delta depth within a packfile, a stricter tree-specific delta encoder, and a minor API change to tree-walk.h, I think we can unpack the delta at the same time that we are walking the tree. No upfront unpack required. Hence no reason to cache. So yea, a base object cache may help us today. It will most definately help in git-blame. But I doubt it will help with trees in pack v4, and I think it will just hurt in most cases. So maybe it should be local to git-blame only. -- Shawn. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html