Re: Proposal: Faster composes by eliminating deltarpms and using zchunked rpms instead

Jonathan Dieter <jdieter@xxxxxxxxx> · Tue, 20 Nov 2018 20:30:11 +0000

Hey, thanks for the detailed analysis.  Comments inline.

On Tue, 2018-11-20 at 00:47 +0100, sixy@xxxxxxx wrote:
> Based on work I've done in this area, I'm somewhat sceptical that this
> will work out well.  I spent a decent amount of time comparing various
> approaches to data saving for both compressed data and regular binaries.
> This was done a few months ago, so a few of the algos may have changed a
> little since then, but I doubt the changes will be huge.

Do you have some examples of the data you tested?
> 
> The various things I tested were, in no particular order:
>  * xdelta3 (and a few similar VCDIFF based algos)
>  * zstd 
>  * zchunk
>  * bsdiff
>  * courgette
>  * zucchini
> And a few others which aren't interesting enough to mention here.
> 
> xdelta3 is old now, but was the standard binary delta compression tool
> for a long time.  Of the better performing algos it has reasonably good 
> generation time, and works reasonably well on compressed data (still a 
> good way behind the top of the pile), but falls far behind on binaries. 
> 
> zstd was generally disappointing for both compressed and binary data.  It's
> really a compression format, not a delta generator, and while it performed 
> very well compared to traditional compression formats, it was unable to
> compete with other tools here.
> 
> zchunk had basically the same problems as zstd,  except that the chunking
> overhead made files even larger, and small symbol changes could mean that
> a very large number of chunks needed to be transmitted.
> 
> bsdiff is a larve improvement over xdelta3 in terms of final size for
> binary data.  It is considerably more expensive in terms of memory and cpu
> to compute, but is fast to apply[0]. It works much less well on compressed
> data, and really needs to work on a binary to be at its best.
> 
> courgette is excellent at reducing binary size, and still very good at
> compressed data.  By data size alone it's the best of all of these, but is
> somewhat complex and expensive, particularly in terms of memory.  An
> over view of how it works is [1]
> 
> zucchini is an experimental delta generation tool for chromium.  I can't
> see much that's been published externally about it, but I found it to 
> compress almost as well as courgette but with greatly reduced memory use.
> Code is very much a moving target, but is located at [2].

These seem to be a bit apples-to-oranges.  xdelta3, bsdiff, courgette
and zucchini are designed to generate deltas between two specific
files.  zstd is just a compression format, while zchunk is designed to
download the smallest difference between two arbitrary files.

You are absolutely right about the cost of small symbol changes, one of
the reasons that courgette does some amount of disassembly and that
bsdiff uses what it calls an "add block".  Because of zchunk's design,
there's no way it can benefit unless we implement some kind of
disassembly, and I'm leery of going down that road.

> Overall in my tests zstd/schunk gave massively worse compression compared
> to modern delta algorithms (courgette, zucchini), and the total time to
> download, apply, install was always slowest for zstd based approaches.

I'd love to see your test methodology.  I'm not surprised about the
difference in delta size (which really isn't the same as compression),
but I am a bit surprised by the speed differences you saw, assuming
your data was compressed.

If your data wasn't, do remember that rpms are compressed, so if you're
creating a new delta method, it will have to decompress both the old
and new rpms before generating the delta (as deltarpm does).

zchunk is able to get around this extra step because it is the
compression format.

> So, then, the only time this would work is if the changed-chunk detection
> feature of zchunk actually works effectively for RPMs.  Unfortunately,
> when binaries change (and when RPMs as a whole change) it's not unusual
> for this to manifest as adding/removing significant chunks of data from
> near the beginning of the file, which means that the chunk matching fails
> and you end up with a huge and slow download.

Thats... not how the chunk matching works.  The chunking function uses
the buzhash algorithm to try to break on consistent patterns, no matter
where data is added or removed.  If you remove x bytes from the
beginning of the file and then add y bytes, at most one or two chunks
should be affected.

> If you care only about making the 50th %ile case better, then zchunk is
> probably at least not any worse than what we have now.  However this
> will, I believe, come at the cost of making the 95th %ile case pretty
> unpleasant for end users.
> 
> I think it'd be far better to explore Howard's proposal, for per-file
> delta generation (as opposed to per-RPM), and use modern delta
> generation (probably courgette) to compute the delta.

Maybe I misunderstood Howard's proposal, but I understood him to be
suggesting that rebuilding a full rpm from delta data is wasteful when
we could just install the delta data like a regular rpm.  Its an
interesting idea that I've commented on elsewhere in this thread.

The problem with generating per-file deltas is the same as per-rpm
deltas.  Infrastructure has to be used to generate the deltas, and the
whole point of this proposal is to eliminate that step.

Having said that, it may be that we can generate deltas more
efficiently, which would make the cost of generating them easier to
bear.  I would be interested in seeing whether courgette/zucchini based
deltarpms would be significantly faster to generate than the current
bsdiff based deltarpms.

Thanks again for your response,

Jonathan
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx