Re: A proof-of-concept for delta'ing repodata

Monday, 12 March 2018

Hi Jonathan,

To me, the zchunk idea looks good.

Incidentally, for the last couple of months, I have been trying to
rethink the way we cache metadata on the clients, as part of the
libdnf (re)design efforts. My goal was to de-duplicate the data
between similar repos in the cache as well as decrease the size that
needs to be downloaded every time (inevitably leading to this topic).

I came up with two different strategies:

1) Chunking

At first, I realized that there's a resemblance of the git data model
(a content-addressable file system) in our repodata.

Git has objects. They can either be blobs or trees. A tree is an index
of objects referred to by their hashes. In our domain, we have
repomd.xml (a tree) that refers to primary.xml and other files
(trees), which in turn refer (well, semantically at least) to
<package> snippets (blobs) and rpm files. What's different from git is
that our trees are xml files and we compress/combine some of them in a
single file (such as primary.xml). On the abstract level, though, the
concept is the same.

With this, you already get a pretty efficient way to distribute a
recursive data structure such as the repodata, if you can break it
down into objects wisely. It might not be super efficient, but it's
many times better than what we have now.

That made me think that either using git (libgit2) directly or doing a
small, lightweight implementation of the core concepts might be the
way to go. I even played with the latter a bit (I didn't get to
breaking down primary.xml, though):
https://github.com/dmnks/rhs-proto

In the context of this thread, this is basically what you do with
zchunk (just much better) :)

2) Deltas

Later, during this year's devconf, I had a few "brainstorming"
sessions with Florian Festi who pointed out that the differences in
metadata updates might often be on the sub-package level (e.g. NEVRA
in the version tag) so chunking on the package boundaries might not
give us the best results possible. Instead perhaps, we could generate
deltas on the binary level.

Git does implement object deltas (see packfiles). However, they
require the webserver to be "smart" while all we can afford in the
Fedora infrastructure are pure HTTP GET requests, so that's already a
no-go.

An alternative would be to pre-generate (compressed) binary deltas for
the last N versions and let clients download an index file that will
tell them what deltas they're missing and should download. This is
basically what debian's pdiff format does. One downside to this
approach is that it doesn't give us the de-duplication on clients
consuming multiple repos with similar content (probably quite common
with RHEL subscriptions at least).

Then I stumbled upon casync which combines the benefits of both
strategies; it chunks based on the shape of the data (arguably giving
better results than chunking on the package boundaries), and it
doesn't require a smart protocol. However, it involves a lot of HTTP
requests as you already mentioned.

Despite that, I'm still leaning towards chunking as being the better
solution of the two. The question is, how much granularity we want.
You made a good point: the repodata format is fixed (be it xml or
solv), so we might as well take advantage of it to detect boundaries
for chunking, rather than using a rolling hash (but I have no data to
back it up). I'm not sure how to approach the many-GET-requests (or
the lack of range support) problem, though.

As part of my efforts, I created this "small" git repo that contains
metadata snapshots since ~February which can be useful to see how
typical metadata updates look like. Feel free to use it (e.g. for
testing out zchunk):
https://pagure.io/mdhist

Thanks,

Michal

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006