Re: A proof-of-concept for delta'ing repodata

Jonathan Dieter <jdieter@xxxxxxxxx> · Tue, 13 Feb 2018 14:03:37 +0200

On Tue, 2018-02-13 at 10:52 +0100, Igor Gnatenko wrote:
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> >  * Many changes to the metadata can mean a large number of ranges
> >    requested.  I ran a check on our mirrors, and three (out of around
> >    150 that had the file I was testing) don't honor range requests at
> >    all, and three others only honor a small number in a single request.
> >     A further seven didn't respond at all (not sure if that had
> >    anything to do with the range requests), and the rest supported
> >    between 256 and 512 ranges in a single request.  We can reduce the
> >    number of ranges requested by always ordering our packages by date. 
> >    This would ensure that new packages are grouped at the end of the
> >    xml where they will be grabbed in one contiguous range.
> 
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there is "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked up (not
> going into details of this). Which means depending on when last update for one
> or other was submitted, users will get different results. This is unacceptable
> from my POV.

That's fair enough, though how hard would it be to change libsolv to
assign Id's based on alphabetical order as opposed to metadata order
(or possibly reorder the xml before sending it to libsolv)?  

To be clear, this optimization would reduce the number of range
requests we have to send to the server, but would not hugely change the
amount we download, so I don't think it's very high priority.

> >  * Zchunk files use zlib (it gives better compression than xz with such
> >    small chunks), but, because they use a custom zdict, they are not gz
> >    files.  This means that we'll need new tools to read and write them.
> >    (And I am volunteering to do the work here)
> 
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.

I'll take a look at both of those.

> As being someone who tried to work on this problem I very appreciate what you
> have done here. We've started with using zsync and results were quite good, but
> zsync is dead and has ton of bugs. Also it requires archives to be `
> --rsyncable`. So my question is why not to add idx file as additional one for
> existing files instead of inventing new format? The problem is that we will
> have to distribute in old format too (for compatibility reasons).

I'm not sure if it was clear, but I'm basically making --rsyncable
archives with more intelligent divisions between the independent
blocks, which is why it gives better delta performance... you're not
getting *any* redundant data.

I did originally experiment with xz files (a series of concatenated xz
files is still a valid xz file), but the files were 20% larger than
zlib with custom zdict.

The zdict helps us reduce file size by allowing all the chunks to use
the same common strings that will not change (mainly tag names), but
custom zdicts aren't allowed by gzip.

I've also toyed with the idea of supporting embedded idx's in zchunk
files so we don't have to keep two files for every local zchunk file. 
We'd still want separate idx files on the webserver, though, otherwise
we're looking at an extra http request to get the size of the index in
the zchunk.  If we embed the index in the file, we must create a new
format as we don't want the index concatenated with the rest of the
uncompressed file when decompressing.

> I'm not sure if trying to do optimizations by XML tags is very good idea
> especially because I hope that in future we would stop distributing XML's and
> start distributing solv/solvx.

zchunk.py shouldn't care what type of data it's chunking, but it needs
to be able to chunk the same way every time.  Currently it only knows
how to do that with XML, because we can split it based on tag
boundaries, and grouping based on source rpm gives us even better
compression without sacrificing any flexibility.

dl_zchunk.py and unzchunk.py neither know, nor care what type of file
they're working with.

Thanks so much for the feedback, and especially for the pointers to lz4
and zstd.  Hopefully they'll get us closer to matching our current gz
size.

Jonathan
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx