Re: Proposed zchunk file format - V3

Jonathan Dieter <jdieter@xxxxxxxxx> · Thu, 22 Mar 2018 13:39:35 +0200

CC'ing fedora-infrastructure, as I think they got lost somewhere along
the way.

On Tue, 2018-03-20 at 17:04 +0100, Michal Domonkos wrote:
<snip>
> Yeah, the level doesn't really matter much.  My point was, as long as
> we chunk, some of the data that we will be downloading we will already
> have locally.  Typically (according to mdhist), it seems that package
> updates are more common than new additions, so we won't be reusing the
> unchanged parts of package tags.  But that's inevitable if we're
> chunking.

Ok, I see your point, and you're absolutely right.

<snip>
> > The beauty of the zchunk format (or zsync, or any other chunked format)
> > is that we don't have to download different files based on what we
> > have, but rather, we download either fewer or more parts of the same
> > file based on what we have.  From the server side, we don't have to
> > worry about the deltas, and the clients just get what they need.
> 
> +1
> 
> Simplicity is key, I think.  Even at the cost of not having the
> perfectly efficient solution.  The whole packaging stack is already
> complicated enough.

+1000 on that last!

<snip>
> While I'm not completely sure about application-specific boundaries
> being superior to buzhash (used by casync) in terms of data savings,
> it's clear that using http range requests and concatenating the
> objects together in a smart way (as you suggested previously) to
> reduce the number of HTTP requests is a good move in the right
> direction.

Just to be clear, zchunk *could* use buzhash.  There's no rule about
where the boundaries need to be, only that the application creating the
zchunk file is consistent.  I'd actually like to make the command-line
utility use buzhash, but I'm trying to keep the code BSD 2-clause, so I
can't just lift casync's buzhash code, and I haven't had time to write
that part myself.  

Currently zck.c has a really ugly if statement that chooses a division
based on string matching if it's true and a really naive inefficient
rolling hash if it's false.  If you wanted to contribute buzhash, I'd
happily take it!

> BTW, in the original thread, you mentioned a reduction of 30-40% when
> using casync.  I'm wondering, how did you measure it?  I saw chunk
> reuse ranging from 80% to 90% per metadata update, which seemed quite
> optimistic.  What I did was:
> 
> $ casync make snap1.caidx /path/to/repodata/snap1
> $ casync make --verbose snap2.caidx /path/to/repodata/snap2
> <snip>
> Reused chunks: X (Y%)
> <snip>

IIRC, I went into the web server logs and measured the number of bytes
that casync actually downloaded as compared to the gzip size of the
data.

Thanks so much for your interest!

Jonathan
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx