sorting yum/dnf metadata and metadata diffs

Fri Feb 13 08:17:09 UTC 2015

Hi,
there's been some work in progress already:
https://bugzilla.redhat.com/show_bug.cgi?id=850896

Proof-of-concept code (to be merged into dnf/createrepo_c in the future):
https://github.com/Tojaj/DeltaRepo

The idea behind that is simple:
* create deltas as small repos on server
* download deltas on client
* do in-memory "mergerepo" on client
   (or cache it on disk if it makes sense)

I consider this approach better than making diffs,
especially because it's simple, clean and it can work with any repo format (sqlite, xml or mix of both).

- daniel

Dne 13.2.2015 v 08:11 Casey Jao napsal(a):
> How feasible would it be to keep the listings in primary.xml and
> filelists.xml sorted by package name and arch? Doing so could open the
> door to simple and efficient diffs of repository metadata.
>
> I recently ran some quick tests using python and elementtree. While the
> F21 primary.xml files from 2/7 and 2/9 both weigh around 2.6M compressed
> and ~18M uncompressed, sorting them and running a simple line-by-line
> comparison revealed a diff of ~500K, which compressed down to ~70K. A
> similar procedure on the 8M filelists.xml yielded a diff which
> compressed to ~200K.
>
> Those two are by far the largest metadata files. If the observed
> improvements are typical, then keeping those files in order and hosting
> the diffs between the present and the previous few days (and modifying
> dnf to look for those diffs) could substantially reduce the amount of
> data that users must download every time a repository is updated, which
> for a fast-moving OS like Fedora could happen nearly every day.
>
>

-- 
Daniel Mach <dmach at redhat.com>
Release Engineering, Red Hat