On Sun, 17 May 2009, Heikki Orsila wrote: > > There was an interesting discussion at > > http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2 > > that involves DAGs and decompression in Git. The problem is achieving > parallelism. The following comment was made: > > "And is it possible to store the block pointers from one object to > another in uncompressed form?" For the biggest form of this, we actually already do. The single biggest win of compression is the delta-compression: the regular zlib compression is generally about a factor-of-two for unpacked and "base" delta entries, but much less for already delta-compressed entries. In comparison, the delta-compression is likely about a factor of 10 or more. And the delta compression already has the SHA1 pointer to the delta base entry uncompressed, but the compression is serialized by the fact that we need to uncompress the base entry in order to then apply a delta on top of it. Now, there's no question that we could have higher levels of parallelism by walking multiple such chains in parallel (we don't always have more than one chain to walk, but sometimes we do). But as I point out in that thread, we currently don't have any locking for the core object datastructures, and adding that locking would likely slow down things more than it speeds things up for the normal case. For 'git fsck', we could speed things up a lot by doing parallel work - and we wouldn't need to have anything else uncompressed, we could just take advantage of the fact that we could try to uncompress the different delta chains in parallel. And yes, fsck is really slow, but on the other hand, it's something that most people never do, and I do about once a month. The fact that it takes four minutes rather than one is not a big deal. There are other forms of compression where the SHA1 pointers are inside the compressed data (the "regular" commit->{commit,tree} relationships and tree->{anything} cases). And yes, we could probably get rid of at least the zlib compression in some of that. Much of that data doesn't even compress very well (SHA1's are basically uncompressible), and the compression is done largely because we have one unified interface for everything (so the basic object code doesn't need to care about different object types or different formats: it's all just binary data with a magic header to it). But in that case, we'd probably not want to keep a separate uncompressed tree, we'd just decide that "compression is too expensive to be worth it". That said, on my laptops, CPU time really _never_ is the issue. Every single time something is slow, the issue is a slow 4200rpm disk that may get 25MB/s off it for linear things in the best case, but seeks take milliseconds and any kind of random access will just kill performance. So in the big picture, I suspect even the wasted CPU-time is worth it. It makes some operations slower, but anything that makes the on-disk data denser is good. Because the case I end up caring most about is always the uncached case. Linus -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html