Request for Comments: updating RPMs using binary deltas.
Lamar Owen
lowen at pari.edu
Thu Jan 8 15:33:01 UTC 2004
Over the past few days on the fedora list, I have participated in a discussion
about 'why can't we distribute patches instead of multimegabyte RPMs for
little changes'.
One of the other participants suggested that this would be better suited to
the -devel list, to which I agree.
To summarize:
1.) A utility for distributing RPM 'deltas' already exists by the name of
rhmask.
2.) Rhmask files may not be optimized for size, since rhmask was designed for
a different problem (how do you distribute patched RPM's for non-open-source
packages: not an issue now, and in fact rhmask is no longer distributed).
3.) A _similar_ thing could be done to reduce both the server storage space
required for updates as well as the download bandwidth for the updates. In
the case of kernel updates, the bandwidth costs have to be huge, particularly
on the server side.
What I am proposing:
1.) Use rsync or something similar to generate an incremental backup of the
patched, unpacked RPM, versus the original distributed RPM (this is not well
known how to do this, but rsync is capable of just copying files that have
changed: see the O'Reilly book 'Linux Server Hacks' hack #42). The delta
must use the original, pristine, as-distributed, RPM as the baseline for
this to not become unwieldy. We would want the file deltas instead of the
whole changed files that an rsync incremental backup would provide.
2.) Package and sign this file, calling it an 'rpmdiff' or something, using
the same basic rpm header structure. This rpmdiff file also carries all the
header info that a traditional errata RPM would carry, including the entire
file list, but would not be directly installable by RPM (unless RPM is
modified to use deltas, but that may be unwieldy).
3.) Locate the pristine RPM. The reason that you can't just install the
rpmdiff and fake out the RPM database as to what files were actually
installed is due to the primary benefit of this technique, and that is when
just a handful of files are changed in, say, the kernel source RPM. The new
kernel-source RPM is installed into a differently versioned directory tree
altogether. This probably would be a 'insert CD#x' type thing, or providing
a URI to the RPM, whether locally stored or on an NFS, FTP, or HTTP server.
We don't want to have to require the user to have all their RPM's installed
in a local disk repository (although that is another thing I'd like to see:
the option of having the installation process install the RPM files
themselves in a local repository that the install tools can then use, but
that's probably not in line with what most people would want.)
4.) The pristine RPM is then 'patched' by the rpmdiff on a file by file level.
The headers from the rpmdiff are used to build the resulting complete RPM
which should be identical with the full errata RPM that would have been
downloaded (except the signature, since the rpmdiff would be signed (and
checked by teh update tools), whereas the reconstructed full rpm would not be
signed, unless the full RPM's signature is transmitted as part of the
rpmdiff). The key to saving space and bandwidth is the use of file-by-file
deltas.
5.) The resulting reconstructed errata RPM is then installed normally, using
up2date or whatever.
6.) There would need to be both a command line 'rpmdiff' tool as well as
up2date support for this to work. Unless the support could be rolled into
rpmlib itself.
7.) The user then enjoys being able to download the updates over dialup and
having a chance to finish in less than five hours, in the case of the
kernel-source update. Dialup users still exist, and they are not going away.
The minor inconvenience of having to insert their original CD's, or have all
the original RPMs stored someplace, IMO outweighs the much larger
inconvenience and cost of the bandwidth problems. If the updates are large
enough one could get compromised in the time it takes to download them. Even
with a high-bandwidth pipe, like a T1 or good DSL, the larger updates,
especially during the testing phases, can take hours to download. I remember
getting beta updates through up2date that took a very long time to download
(and even longer to install, since virtually every package in the whole
distribution had changed, many of which by very few bytes).
8.) The updates repository enjoys being able to service many more users per
hour, since each user takes less time and less bandwidth. And hundreds of GB
are no longer required for a full mirror of all the updates.
9.) Updates get applied quicker, since they take less time. (that would be a
yogi-ism....)
So we trade off the CPU and disk of doing the delta versus the bandwidth of
the download.
Comments?
--
Lamar Owen
Director of Information Technology
Pisgah Astronomical Research Institute
1 PARI Drive
Rosman, NC 28772
(828)862-5554
www.pari.edu
More information about the devel
mailing list