Hi Jonathan, To me, the zchunk idea looks good. Incidentally, for the last couple of months, I have been trying to rethink the way we cache metadata on the clients, as part of the libdnf (re)design efforts. My goal was to de-duplicate the data between similar repos in the cache as well as decrease the size that needs to be downloaded every time (inevitably leading to this topic). I came up with two different strategies: 1) Chunking At first, I realized that there's a resemblance of the git data model (a content-addressable file system) in our repodata. Git has objects. They can either be blobs or trees. A tree is an index of objects referred to by their hashes. In our domain, we have repomd.xml (a tree) that refers to primary.xml and other files (trees), which in turn refer (well, semantically at least) to <package> snippets (blobs) and rpm files. What's different from git is that our trees are xml files and we compress/combine some of them in a single file (such as primary.xml). On the abstract level, though, the concept is the same. With this, you already get a pretty efficient way to distribute a recursive data structure such as the repodata, if you can break it down into objects wisely. It might not be super efficient, but it's many times better than what we have now. That made me think that either using git (libgit2) directly or doing a small, lightweight implementation of the core concepts might be the way to go. I even played with the latter a bit (I didn't get to breaking down primary.xml, though): https://github.com/dmnks/rhs-proto In the context of this thread, this is basically what you do with zchunk (just much better) :) 2) Deltas Later, during this year's devconf, I had a few "brainstorming" sessions with Florian Festi who pointed out that the differences in metadata updates might often be on the sub-package level (e.g. NEVRA in the version tag) so chunking on the package boundaries might not give us the best results possible. Instead perhaps, we could generate deltas on the binary level. Git does implement object deltas (see packfiles). However, they require the webserver to be "smart" while all we can afford in the Fedora infrastructure are pure HTTP GET requests, so that's already a no-go. An alternative would be to pre-generate (compressed) binary deltas for the last N versions and let clients download an index file that will tell them what deltas they're missing and should download. This is basically what debian's pdiff format does. One downside to this approach is that it doesn't give us the de-duplication on clients consuming multiple repos with similar content (probably quite common with RHEL subscriptions at least). Then I stumbled upon casync which combines the benefits of both strategies; it chunks based on the shape of the data (arguably giving better results than chunking on the package boundaries), and it doesn't require a smart protocol. However, it involves a lot of HTTP requests as you already mentioned. Despite that, I'm still leaning towards chunking as being the better solution of the two. The question is, how much granularity we want. You made a good point: the repodata format is fixed (be it xml or solv), so we might as well take advantage of it to detect boundaries for chunking, rather than using a rolling hash (but I have no data to back it up). I'm not sure how to approach the many-GET-requests (or the lack of range support) problem, though. As part of my efforts, I created this "small" git repo that contains metadata snapshots since ~February which can be useful to see how typical metadata updates look like. Feel free to use it (e.g. for testing out zchunk): https://pagure.io/mdhist Thanks, Michal _______________________________________________ infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx