Re: On data structures and parallelism

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 17 May 2009 10:06:26 -0700 (PDT)

On Sun, 17 May 2009, Heikki Orsila wrote:
>
> There was an interesting discussion at
> 
> http://realworldtech.com/forums/index.cfm?action=detail&id=98909&threadid=98430&roomid=2
> 
> that involves DAGs and decompression in Git. The problem is achieving 
> parallelism. The following comment was made:
> 
> "And is it possible to store the block pointers from one object to 
> another in uncompressed form?"

For the biggest form of this, we actually already do.

The single biggest win of compression is the delta-compression: the 
regular zlib compression is generally about a factor-of-two for unpacked 
and "base" delta entries, but much less for already delta-compressed 
entries. In comparison, the delta-compression is likely about a factor of 
10 or more.

And the delta compression already has the SHA1 pointer to the delta base 
entry uncompressed, but the compression is serialized by the fact that we 
need to uncompress the base entry in order to then apply a delta on top of 
it.

Now, there's no question that we could have higher levels of parallelism 
by walking multiple such chains in parallel (we don't always have more 
than one chain to walk, but sometimes we do). But as I point out in that 
thread, we currently don't have any locking for the core object 
datastructures, and adding that locking would likely slow down things more 
than it speeds things up for the normal case.

For 'git fsck', we could speed things up a lot by doing parallel work - 
and we wouldn't need to have anything else uncompressed, we could just 
take advantage of the fact that we could try to uncompress the different 
delta chains in parallel. And yes, fsck is really slow, but on the other 
hand, it's something that most people never do, and I do about once a 
month. The fact that it takes four minutes rather than one is not a big 
deal.

There are other forms of compression where the SHA1 pointers are inside 
the compressed data (the "regular" commit->{commit,tree} relationships and 
tree->{anything} cases).

And yes, we could probably get rid of at least the zlib compression in 
some of that. Much of that data doesn't even compress very well (SHA1's 
are basically uncompressible), and the compression is done largely because 
we have one unified interface for everything (so the basic object code 
doesn't need to care about different object types or different formats: 
it's all just binary data with a magic header to it).

But in that case, we'd probably not want to keep a separate uncompressed 
tree, we'd just decide that "compression is too expensive to be worth it".

That said, on my laptops, CPU time really _never_ is the issue. Every 
single time something is slow, the issue is a slow 4200rpm disk that may 
get 25MB/s off it for linear things in the best case, but seeks take 
milliseconds and any kind of random access will just kill performance.

So in the big picture, I suspect even the wasted CPU-time is worth it. It 
makes some operations slower, but anything that makes the on-disk data 
denser is good. Because the case I end up caring most about is always the 
uncached case.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html