On Thu, Mar 22, 2018 at 12:39 PM, Jonathan Dieter <jdieter@xxxxxxxxx> wrote: > CC'ing fedora-infrastructure, as I think they got lost somewhere along > the way. Oh, thanks. I screwed up (again), this time by hitting "Reply" instead of "Reply to all" in gmail (*facepalm*). > Just to be clear, zchunk *could* use buzhash. There's no rule about > where the boundaries need to be, only that the application creating the > zchunk file is consistent. I'd actually like to make the command-line > utility use buzhash, but I'm trying to keep the code BSD 2-clause, so I > can't just lift casync's buzhash code, and I haven't had time to write > that part myself. Makes sense, thanks for the clarification! For completeness, I'm also copying below the "git concept" idea that I elaborated on in the "lost" email: ---8<----- The git concept is basically just a generalization of the chunking idea we're talking about. As long as your data semantically represents a tree, you can chunk up the content and get a structure like this: tree +-- tree +-- blob1 +-- blob2 +-- blob3 +-- tree +-- blob1 +-- blob2 +-- blob3 +-- ... Now, to sync this structure locally, a simple recursive algorithm is used: look at the root tree to see what objects it needs (i.e. gather a list of hashes), then download them and do the same with those recursively until you have no more incomplete trees left. In order to avoid having too many files, blobs could be stored in one file (maybe per tree) and accessed via HTTP ranges, the same way as in zchunk. The point is, you will only have to fetch those subtrees where some objects along the path have changed. The effectiveness is then a (logarithmic) function of how deep and how well you do the chunking of your content. Applying this to our domain, we have: repomd (tree) +-- primary (tree) +-- srpm1 (tree) +-- rpm1 (blob) +-- rpm2 (blob) +-- srpm2 (tree) +-- srpm3 (tree) +-- filelists (tree) +-- ... Doing a "checkout" of such a structure would result in the traditional metadata files we're using now. That's just for backward compatibility; we could, of course, have a different structure that's better suited for our use case. As you can see, this is really just zchunk, only generalized (not sure if compressions plays a role here, I haven't considered it). ---8<----- Regards, Michal _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx