Re: Decompression speed: zip vs lzo

"Dana How" <danahow@xxxxxxxxx> · Thu, 10 Jan 2008 11:34:52 -0800

On Jan 9, 2008 10:55 PM, Marco Costalba <mcostalba@xxxxxxxxx> wrote:
> On Jan 10, 2008 4:41 AM, Nicolas Pitre <nico@xxxxxxx> wrote:
> > On Wed, 9 Jan 2008, Johannes Schindelin wrote:
> >
> > > I agree that gzip is already fast enough.
> > >
> > > However, pack v4 had more goodies than just being faster; it also promised
> > > to have smaller packs.
> >
> > Right, like not having to compress tree objects and half of commit
> > objects at all.
>
> Decompression speed has been shown to be a bottle neck on some tests
> involving mainly 'git log'.

Thanks for looking into this,  in this email and your subsequent ones.

I agree that zip time is an issue.  I was looking into reducing the _number_
of zip calls on the same data,  but work and personal crises have reduced
me from an infrequent contributor to an occasional gadfly for the moment.

> Regarding back compatibility I really don't know at what level git
> functions actually need to know the compression format, looking at the
> code I would say at very low level, functions that deal directly with
> inflate() and friends are few [1] and not directly connected to UI,
> nor to git config. Is this compression format something user should
> know/care? and if yes why?
>
> In my tests the assumption of a source files tar ball is unrealistic,
> to test the final size difference I would like testing different
> compressions on a big already packaged but still not zipped file.
> Someone could be so kind to hint me on how to create such a package
> with good quality, i.e. with packaging levels similar to what is done
> for public repos?
>
> This does not realistically tests speed because as Junio pointed out
> the real decompressing schema is different: many calls on small
> objects, not one call on a big one. But if final size is acceptable we
> can go on more difficult tests.

The approach you're taking (here and in following emails) of being
able to make zip/lzo selection and measure the results should be
enlightening.  For the vast majority of git users,  Junio's scenario
is the most relevant.

Of additional interest to me is handling enormous objects more quickly.
I would like to replace some p4 usage here with git,  but most users will
only notice the speed difference and not use git's extra features.  Thus
they will compare git add/git commit/git push unfavorably to p4 edit/p4 submit,
because the former effectively does zip/unzip/zip/send,  while the latter
only does zip/send (git's extra "unzip/zip" comes from loose objects not
being directly copyable into packs).  This speed difference is irrelevant
for small to normal files,  but a killer when commiting a collection of say
100MB files.

Your lzo option could reduce this performance degradation vs p4 from
3x to close to 1.5x.  If you get it accepted,  I'd love to then "fix" the loose
object copying "problem" making git _faster_ than p4 on large files!
2 simple forms for this "fix" would be to use the once-and-future "new"
loose object format (an idea already rejected),  or to encode all loose
objects as singleton packs under .git/objects/xx (so that all (re)packing,
in the absence of new deltification,  becomes pack-to-pack copying).
This latter idea is a modification of an idea from Nicolas Pitre.
It certainly adds less code than other approaches for such a "fix".

Thanks,
-- 
Dana L. How  danahow@xxxxxxxxx  +1 650 804 5991 cell
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html