Re: sorting yum/dnf metadata and metadata diffs

Daniel Mach <dmach@xxxxxxxxxx> · Fri, 13 Feb 2015 09:17:09 +0100

Hi,
there's been some work in progress already:
https://bugzilla.redhat.com/show_bug.cgi?id=850896

Proof-of-concept code (to be merged into dnf/createrepo_c in the future):
https://github.com/Tojaj/DeltaRepo

The idea behind that is simple:
* create deltas as small repos on server
* download deltas on client
* do in-memory "mergerepo" on client
  (or cache it on disk if it makes sense)

I consider this approach better than making diffs,
especially because it's simple, clean and it can work with any repo format (sqlite, xml or mix of both).

- daniel

Dne 13.2.2015 v 08:11 Casey Jao napsal(a):
How feasible would it be to keep the listings in primary.xml and
filelists.xml sorted by package name and arch? Doing so could open the
door to simple and efficient diffs of repository metadata.

I recently ran some quick tests using python and elementtree. While the
F21 primary.xml files from 2/7 and 2/9 both weigh around 2.6M compressed
and ~18M uncompressed, sorting them and running a simple line-by-line
comparison revealed a diff of ~500K, which compressed down to ~70K. A
similar procedure on the 8M filelists.xml yielded a diff which
compressed to ~200K.

Those two are by far the largest metadata files. If the observed
improvements are typical, then keeping those files in order and hosting
the diffs between the present and the previous few days (and modifying
dnf to look for those diffs) could substantially reduce the amount of
data that users must download every time a repository is updated, which
for a fast-moving OS like Fedora could happen nearly every day.

--
Daniel Mach <dmach@xxxxxxxxxx>
Release Engineering, Red Hat
--
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/devel
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct