On Tue, Apr 3, 2012 at 10:49 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: >> Has anyone looked seriously at a new index format that stores the >> redundant information in a more easily accessible way? It would increase >> our disk usage, but for something like linux-2.6, only by 10MB per >> 32-bit word. On most of my systems I would gladly spare some extra RAM >> for the disk cache if it meant I could avoid inflating a bunch of >> objects. And this could easily be made optional for systems that don't >> want to make the tradeoff (if it's not there, you fall back to the >> current procedure; we could even store the data in a separate file to >> retain indexv2 compatibility). >> >> So it's sort-of a cache, in that it's redundant with the actual data. >> But staleness and writing issues are a lot simpler, since it only gets >> updated when we index the pack (and the pack index in general is a >> similar concept; we are "caching" the location of the object in the >> packfile, rather than doing a linear search to look it up each time). > > I think I have something like that, (generate a machine-friendly > commit cache per pack, staying in $GIT_DIR/objects/pack/ too). It's > separate cache staying in $GIT_DIR/objects/pack, just like pack-.idx > files. It does improve rev-list time, but I'd rather wait for packv4, > or at least be sure that packv4 will not come anytime soon, before > pushing the cache route. When I looked at commit cache for rev-list, I tried to cache trees too but the result cache was too big. I managed to shrink the tree cache down and measured the performance gain. Sorry no code here because it's ugly, just numbers, but you can look at the cache generation code at [1] On linux-2.6.git, one 521MB pack, it generates a 356MB cache and a 30MB index companion. Though if you are willing to pay extra 5 seconds for decompressing, then the cache can go down to 94MB. We can cut nearly half "rev-list --objects --all" time with this cache (uncompressed cache): $ time ~/w/git/git rev-list --objects --all --quiet </dev/null real 2m31.310s user 2m28.735s sys 0m1.604s $ time TREE_CACHE=cache ~/w/git/git rev-list --objects --all --quiet </dev/null real 1m6.810s user 1m6.091s sys 0m0.708s $ time ~/w/git/git rev-list --all --quiet </dev/null real 0m14.261s # should be cut down to one third with commit cache user 0m14.088s sys 0m0.171s Not really good. "rev-list --objects"'s taking less than 30s would be nicer. lookup_object() is on top from 'perf' report with cache on. Not sure what to do with it. [1] https://gist.github.com/2310819 -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html