Re: Proposed zchunk file format - V3

Michal Domonkos <mdomonko@xxxxxxxxxx> · Thu, 22 Mar 2018 14:45:28 +0100

On Thu, Mar 22, 2018 at 12:39 PM, Jonathan Dieter <jdieter@xxxxxxxxx> wrote:
> CC'ing fedora-infrastructure, as I think they got lost somewhere along
> the way.

Oh, thanks.  I screwed up (again), this time by hitting "Reply"
instead of "Reply to all" in gmail (*facepalm*).

> Just to be clear, zchunk *could* use buzhash.  There's no rule about
> where the boundaries need to be, only that the application creating the
> zchunk file is consistent.  I'd actually like to make the command-line
> utility use buzhash, but I'm trying to keep the code BSD 2-clause, so I
> can't just lift casync's buzhash code, and I haven't had time to write
> that part myself.

Makes sense, thanks for the clarification!

For completeness, I'm also copying below the "git concept" idea that I
elaborated on in the "lost" email:

---8<-----

The git concept is basically just a generalization of the chunking
idea we're talking about.

As long as your data semantically represents a tree, you can chunk up
the content and get a structure like this:

tree
    +-- tree
        +-- blob1
        +-- blob2
        +-- blob3
    +-- tree
        +-- blob1
        +-- blob2
        +-- blob3
    +-- ...

Now, to sync this structure locally, a simple recursive algorithm is
used:  look at the root tree to see what objects it needs (i.e. gather
a list of hashes), then download them and do the same with those
recursively until you have no more incomplete trees left.  In order to
avoid having too many files, blobs could be stored in one file (maybe
per tree) and accessed via HTTP ranges, the same way as in zchunk.

The point is, you will only have to fetch those subtrees where some
objects along the path have changed.  The effectiveness is then a
(logarithmic) function of how deep and how well you do the chunking of
your content.

Applying this to our domain, we have:

repomd (tree)
    +-- primary (tree)
        +-- srpm1 (tree)
            +-- rpm1 (blob)
            +-- rpm2 (blob)
        +-- srpm2 (tree)
        +-- srpm3 (tree)
    +-- filelists (tree)
        +-- ...

Doing a "checkout" of such a structure would result in the traditional
metadata files we're using now.  That's just for backward
compatibility; we could, of course, have a different structure that's
better suited for our use case.

As you can see, this is really just zchunk, only generalized (not sure
if compressions plays a role here, I haven't considered it).

---8<-----

Regards,

Michal
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx