Request for Comments: updating RPMs using binary deltas.

Thu Jan 8 15:33:01 UTC 2004

Over the past few days on the fedora list, I have participated in a discussion 
about 'why can't we distribute patches instead of multimegabyte RPMs for 
little changes'.

One of the other participants suggested that this would be better suited to 
the -devel list, to which I agree.

To summarize:
1.)	A utility for distributing RPM 'deltas' already exists by the name of 
rhmask.
2.)	Rhmask files may not be optimized for size, since rhmask was designed for 
a different problem (how do you distribute patched RPM's for non-open-source 
packages: not an issue now, and in fact rhmask is no longer distributed).
3.)	A _similar_ thing could be done to reduce both the server storage space 
required for updates as well as the download bandwidth for the updates.  In 
the case of kernel updates, the bandwidth costs have to be huge, particularly 
on the server side.

What I am proposing:

1.)	Use rsync or something similar to generate an incremental backup of the 
patched, unpacked RPM, versus the original distributed RPM (this is not well 
known how to do this, but rsync is capable of just copying files that have 
changed: see the O'Reilly book 'Linux Server Hacks' hack #42).  The delta 
must use the original, pristine,  as-distributed, RPM as the baseline for 
this to not become unwieldy.  We would want the file deltas instead of the 
whole changed files that an rsync incremental backup would provide.

2.)	Package and sign this file, calling it an 'rpmdiff' or something, using 
the same basic rpm header structure.  This rpmdiff file also carries all the 
header info that a traditional errata RPM would carry, including the entire 
file list, but would not be directly installable by RPM (unless RPM is 
modified to use deltas, but that may be unwieldy).

3.)	Locate the pristine RPM.  The reason that you can't just install the 
rpmdiff and fake out the RPM database as to what files were actually 
installed is due to the primary benefit of this technique, and that is when 
just a handful of files are changed in, say, the kernel source RPM.  The new 
kernel-source RPM is installed into a differently versioned directory tree 
altogether.   This probably would be a 'insert CD#x' type thing, or providing 
a URI to the RPM, whether locally stored or on an NFS, FTP, or HTTP server.  
We don't want to have to require the user to have all their RPM's installed 
in a local disk repository (although that is another thing I'd like to see: 
the option of having the installation process install the RPM files 
themselves in a local repository that the install tools can then use, but 
that's probably not in line with what most people would want.)

4.)	The pristine RPM is then 'patched' by the rpmdiff on a file by file level.  
The headers from the rpmdiff are used to build the resulting complete RPM 
which should be identical with the full errata RPM that would have been 
downloaded (except the signature, since the rpmdiff would be signed (and 
checked by teh update tools), whereas the reconstructed full rpm would not be 
signed, unless the full RPM's signature is transmitted as part of the 
rpmdiff).  The key to saving space and bandwidth is the use of file-by-file 
deltas.

5.)	The resulting reconstructed errata RPM is then installed normally, using 
up2date or whatever.

6.)	There would need to be both a command line 'rpmdiff' tool as well as 
up2date support for this to work.  Unless the support could be rolled into 
rpmlib itself.

7.)	The user then enjoys being able to download the updates over dialup and 
having a chance to finish in less than five hours, in the case of the 
kernel-source update.  Dialup users still exist, and they are not going away.  
The minor inconvenience of having to insert their original CD's, or have all 
the original RPMs stored someplace, IMO outweighs the much larger 
inconvenience and cost of the bandwidth problems.  If the updates are large 
enough one could get compromised in the time it takes to download them.  Even 
with a high-bandwidth pipe, like a T1 or good DSL, the larger updates, 
especially during the testing phases, can take hours to download.  I remember 
getting beta updates through up2date that took a very long time to download 
(and even longer to install, since virtually every package in the whole 
distribution had changed, many of which by very few bytes).

8.)	The updates repository enjoys being able to service many more users per 
hour, since each user takes less time and less bandwidth.  And hundreds of GB 
are no longer required for a full mirror of all the updates.

9.)	Updates get applied quicker, since they take less time. (that would be a 
yogi-ism....)

So we trade off the CPU and disk of doing the delta versus the bandwidth of 
the download.

Comments?
-- 
Lamar Owen
Director of Information Technology
Pisgah Astronomical Research Institute
1 PARI Drive
Rosman, NC  28772
(828)862-5554
www.pari.edu