Re: Proposed zchunk file format - V3

Neal Gompa <ngompa13@xxxxxxxxx> · Thu, 22 Mar 2018 10:11:17 -0400

On Thu, Mar 22, 2018 at 9:45 AM, Michal Domonkos <mdomonko@xxxxxxxxxx> wrote:
> On Thu, Mar 22, 2018 at 12:39 PM, Jonathan Dieter <jdieter@xxxxxxxxx> wrote:
>> CC'ing fedora-infrastructure, as I think they got lost somewhere along
>> the way.
>
> Oh, thanks.  I screwed up (again), this time by hitting "Reply"
> instead of "Reply to all" in gmail (*facepalm*).
>
>> Just to be clear, zchunk *could* use buzhash.  There's no rule about
>> where the boundaries need to be, only that the application creating the
>> zchunk file is consistent.  I'd actually like to make the command-line
>> utility use buzhash, but I'm trying to keep the code BSD 2-clause, so I
>> can't just lift casync's buzhash code, and I haven't had time to write
>> that part myself.
>
> Makes sense, thanks for the clarification!
>
> For completeness, I'm also copying below the "git concept" idea that I
> elaborated on in the "lost" email:
>
> ---8<-----
>
> The git concept is basically just a generalization of the chunking
> idea we're talking about.
>
> As long as your data semantically represents a tree, you can chunk up
> the content and get a structure like this:
>
> tree
>     +-- tree
>         +-- blob1
>         +-- blob2
>         +-- blob3
>     +-- tree
>         +-- blob1
>         +-- blob2
>         +-- blob3
>     +-- ...
>
> Now, to sync this structure locally, a simple recursive algorithm is
> used:  look at the root tree to see what objects it needs (i.e. gather
> a list of hashes), then download them and do the same with those
> recursively until you have no more incomplete trees left.  In order to
> avoid having too many files, blobs could be stored in one file (maybe
> per tree) and accessed via HTTP ranges, the same way as in zchunk.
>
> The point is, you will only have to fetch those subtrees where some
> objects along the path have changed.  The effectiveness is then a
> (logarithmic) function of how deep and how well you do the chunking of
> your content.
>
> Applying this to our domain, we have:
>
> repomd (tree)
>     +-- primary (tree)
>         +-- srpm1 (tree)
>             +-- rpm1 (blob)
>             +-- rpm2 (blob)
>         +-- srpm2 (tree)
>         +-- srpm3 (tree)
>     +-- filelists (tree)
>         +-- ...
>
> Doing a "checkout" of such a structure would result in the traditional
> metadata files we're using now.  That's just for backward
> compatibility; we could, of course, have a different structure that's
> better suited for our use case.
>
> As you can see, this is really just zchunk, only generalized (not sure
> if compressions plays a role here, I haven't considered it).
>
> ---8<-----
>

One thing I'm concerned about is handling appended metadata. For
example, both Mageia and openSUSE append AppStream metadata to the
repodata, using a combination of appstream-builder[1] (or
appstream-generator[2]) and modifyrepo_c[3]. How does this scale to
handling that?

[1]: https://www.mankier.com/1/appstream-builder
[2]: https://www.mankier.com/1/appstream-generator
[3]: https://www.mankier.com/8/modifyrepo_c

-- 
真実はいつも一つ！/ Always, there's only one truth!
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx