On Thu, Mar 22, 2018 at 9:45 AM, Michal Domonkos <mdomonko@xxxxxxxxxx> wrote: > On Thu, Mar 22, 2018 at 12:39 PM, Jonathan Dieter <jdieter@xxxxxxxxx> wrote: >> CC'ing fedora-infrastructure, as I think they got lost somewhere along >> the way. > > Oh, thanks. I screwed up (again), this time by hitting "Reply" > instead of "Reply to all" in gmail (*facepalm*). > >> Just to be clear, zchunk *could* use buzhash. There's no rule about >> where the boundaries need to be, only that the application creating the >> zchunk file is consistent. I'd actually like to make the command-line >> utility use buzhash, but I'm trying to keep the code BSD 2-clause, so I >> can't just lift casync's buzhash code, and I haven't had time to write >> that part myself. > > Makes sense, thanks for the clarification! > > For completeness, I'm also copying below the "git concept" idea that I > elaborated on in the "lost" email: > > ---8<----- > > The git concept is basically just a generalization of the chunking > idea we're talking about. > > As long as your data semantically represents a tree, you can chunk up > the content and get a structure like this: > > tree > +-- tree > +-- blob1 > +-- blob2 > +-- blob3 > +-- tree > +-- blob1 > +-- blob2 > +-- blob3 > +-- ... > > Now, to sync this structure locally, a simple recursive algorithm is > used: look at the root tree to see what objects it needs (i.e. gather > a list of hashes), then download them and do the same with those > recursively until you have no more incomplete trees left. In order to > avoid having too many files, blobs could be stored in one file (maybe > per tree) and accessed via HTTP ranges, the same way as in zchunk. > > The point is, you will only have to fetch those subtrees where some > objects along the path have changed. The effectiveness is then a > (logarithmic) function of how deep and how well you do the chunking of > your content. > > Applying this to our domain, we have: > > repomd (tree) > +-- primary (tree) > +-- srpm1 (tree) > +-- rpm1 (blob) > +-- rpm2 (blob) > +-- srpm2 (tree) > +-- srpm3 (tree) > +-- filelists (tree) > +-- ... > > Doing a "checkout" of such a structure would result in the traditional > metadata files we're using now. That's just for backward > compatibility; we could, of course, have a different structure that's > better suited for our use case. > > As you can see, this is really just zchunk, only generalized (not sure > if compressions plays a role here, I haven't considered it). > > ---8<----- > One thing I'm concerned about is handling appended metadata. For example, both Mageia and openSUSE append AppStream metadata to the repodata, using a combination of appstream-builder[1] (or appstream-generator[2]) and modifyrepo_c[3]. How does this scale to handling that? [1]: https://www.mankier.com/1/appstream-builder [2]: https://www.mankier.com/1/appstream-generator [3]: https://www.mankier.com/8/modifyrepo_c -- 真実はいつも一つ!/ Always, there's only one truth! _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx