On Thu, Jan 10, 2008 at 09:30:59PM +0000, Nicolas Pitre wrote: > On Thu, 10 Jan 2008, Linus Torvalds wrote: > > > > > > > On Thu, 10 Jan 2008, Nicolas Pitre wrote: > > > > > > Here's my rather surprising results: > > > > > > My kernel repo pack size without the patch: 184275401 bytes > > > Same repo with the above patch applied: 205204930 bytes > > > > > > So it is only 11% larger. I was expecting much more. > > > > It's probably worth doing those statistics on some other projects. > > > > Maybe the difference to other repositories isn't huge, and maybe the > > kernel *is* a good test-case, but I just wouldn't take that for granted. > > Obviously. > > This was a really crud test, and my initial goal was to quickly dismiss > Pierre's assertion. Turns out that he wasn't that wrong after all, Well that wasn't a random assertion, I made it, because I assumed that a delta is usually less than a few hundred bytes, and as compression is applied only to the delta without context, you end up packing 500 bytes per 500 bytes which will seldomly have excellent compression ratios. > and > if a significant increase in access speed by avoiding zlib for 82% of > object accesses can also be demonstrated for the kernel, then we have an > opportunity for some optimization tradeoff with no backward > compatibility concerns. Well, one could use the fact that deltas are not packed to avoid copying them around, and that will _necessarily_ become a gain (you can read them where they have been mmapped for instance). The number that were given for git annotate use a compression of `0' which doesn't use that fact, and I wouldn't be surprised to see a noticeable gain if one does that. And actually, maybe that it's not the deltas we should not pack, but objects under a certain size (say 512 bytes e.g. ?), whichever type they have, and to have the code exploit that fact for real, and avoid copies. With this criterion, I expect the repository to not grow a lot larger (I'd say quite less than the 10% you had, as even in the kernel, there _are_ some larger deltas, and we definitely loose space for them, I'd expect less than a 5% size variation), and I _think_ it's worth investigating. At least I expect visible results on commands (like blame of even log[0]) that go through a lot of small objects to see 10 to 20% increase speed (backed up by some experience I have in avoiding copies in not-so-similar cases though, so it may be less, and I'll stand corrected -- and disappointed, a bit). [0] If I'm correct commit messages are "objects" on their own, and I don't expect them to be very often over 512 octets. -- ·O· Pierre Habouzit ··O madcoder@xxxxxxxxxx OOO http://www.madism.org
Attachment:
pgpwLrTosbawd.pgp
Description: PGP signature