Here is a couple of demo short scripts for generating and applying deltas between two rpms. The key step is to use the minigzip program from the zlib source (see the docs directory of the zlib-devel package), as minigzip uses compression compatible with the compression rpm uses. If I was writing this properly, I would consider using python, which seems to have a zlib modules, and could also hook into some rpm libraries.
Michael Young
On Sunday 25 January 2004 16:49, M A Young wrote:
Here is a couple of demo short scripts for generating and applying deltas between two rpms. The key step is to use the minigzip program from the zlib source (see the docs directory of the zlib-devel package), as minigzip uses compression compatible with the compression rpm uses. If I was writing this properly, I would consider using python, which seems to have a zlib modules, and could also hook into some rpm libraries.
Michael Young
makerpmdelta.sh script --- cut --- rpm2cpio <$FILE1 >$FILE1.cpio rpm2cpio <$FILE2 >$FILE2.cpio --- cut --- This way you loose pre/post install/uninstall scripts AFAIK. Also what about signed packages? I don't say that not having xdelta/patch like tool for rpms is a good thing, but it looks like it's not that easy... Take a look in the list archives, there was a discussion about this problem.
makerpmdelta.sh script --- cut --- rpm2cpio <$FILE1 >$FILE1.cpio rpm2cpio <$FILE2 >$FILE2.cpio --- cut --- This way you loose pre/post install/uninstall scripts AFAIK. Also what about signed packages? I don't say that not having xdelta/patch like tool for rpms is a good thing, but it looks like it's not that easy...
Read the whole script, or try it! It is a working demo, and the delta files produced are of about the size you would get for a properly implemented rpm delta. It just isn't user friendly.
Michael Young
On Mon, 2004-01-26 at 05:47, Doncho N. Gunchev wrote:
makerpmdelta.sh script --- cut --- rpm2cpio <$FILE1 >$FILE1.cpio rpm2cpio <$FILE2 >$FILE2.cpio --- cut --- This way you loose pre/post install/uninstall scripts AFAIK. Also what about signed packages? I don't say that not having xdelta/patch like tool for rpms is a good thing, but it looks like it's not that easy... Take a look in the list archives, there was a discussion about this problem.
I've tested this on a few rpms I have laying around and it handles signatures and scripts so far.
I think this part later in the makerpmdelta.sh script takes care of the non-cpio parts: # Bundle the 1st rpm and the 2nd cpio together (we really only need the # non-archive bits of the first rpm) cat $FILE1 $FILE2.cpio.gz >$FILE1.tmp xdelta delta -p $FILE1.tmp $FILE2 $FILE1.delta
As a proof of concept this works so far for me. Unless someone else has already done this, I think I'll test it against the updates-released repository and see if I can generate any errors.
-Toshio
On Mon, 2004-01-26 at 10:13, Toshio wrote:
As a proof of concept this works so far for me. Unless someone else has already done this, I think I'll test it against the updates-released repository and see if I can generate any errors.
No errors from testing updates-released. All signatures verify and packages intact. List of space savings at the bottom of this message. The range is 19.5% - 99.2% savings with an average of 63.9%. I think more than halving the bandwidth is an achievement. However, you might notice that the kernel-source package is missing from this list. I cancelled the xdelta after a couple hours of running. I think the limiting factor on my machine was memory.
I looked into python as an eventual language for this and found that it has rpm bindings and zlib bindings but no built in binary diff/xdelta support. There is an add-on module http://freshmeat.net/projects/pysync/ for librsync, though, that may fit the bill.
I've taken a look at SuSE's website and it seems their use case for their patch RPMs are to distribute as an alternative for full update RPMS against their base release.
kernel: 43.94% of original: 56.06% savings lftp: 45.13% of original: 54.87% savings quagga-devel: 12.10% of original: 87.90% savings ethereal: 71.32% of original: 28.68% savings kernel-doc: 0.93% of original: 99.07% savings redhat-config-printer: 0.80% of original: 99.20% savings redhat-config-printer-gui: 2.23% of original: 97.77% savings xboard: 66.87% of original: 33.13% savings quagga: 36.31% of original: 63.69% savings ethereal-gnome: 80.54% of original: 19.46% savings gaim: 40.51% of original: 59.49% savings mod_ssl: 31.90% of original: 68.10% savings bash: 8.84% of original: 91.16% savings rsync: 39.80% of original: 60.20% savings grep: 12.05% of original: 87.95% savings net-snmp: 48.08% of original: 51.92% savings quagga-contrib: 14.49% of original: 85.51% savings gnucash: 9.24% of original: 90.76% savings dia: 63.72% of original: 36.28% savings sed: 10.25% of original: 89.75% savings gnupg: 43.28% of original: 56.72% savings gnucash-backend-postgres: 1.40% of original: 98.60% savings procps: 9.27% of original: 90.73% savings binutils: 3.64% of original: 96.36% savings Total RPM Size: 45140838 Total Delta Size: 16309380 Total Savings: 28831458 63.87% savings -- Toshio toshio@tiki-lounge.com
Hello Toshio, Michael,
binutils: 3.64% of original: 96.36% savings
I seem to be unable to patch with just the .cpio.delta. I need the .delta to make this work. What is the saving here? And why do I need minigzip instead of gzip?
Bye, Leonard.
On Mon, 2004-01-26 at 18:18, Leonard den Ottolander wrote:
Hello Toshio, Michael,
binutils: 3.64% of original: 96.36% savings
I seem to be unable to patch with just the .cpio.delta. I need the .delta to make this work. What is the saving here? And why do I need minigzip instead of gzip?
The list of savings I compiled included both the .cpio.delta and the .delta so they should be correct. 100*(cpioDeltaSize+headerDeltaSize)/rpmSize = % of original 100 - % of original = percent savings
The .cpio.delta is the delta of the rpm payload. The .delta is the delta of the header information.
rpm uses zlib to compress its payload. This does not create the same compressed data as gzip. minigzip, which links to zlib to get its compression routines, can generate files that are the same as those created by rpm. This matters because the compressed payload has been gpg-signed. If the payload is different, the reconstructed rpm won't have a valid signature.
-Toshio
P.S. my short python hack to compute the savings on files is attached in case you want to check my methodology :-)
On Mon, 2004-01-26 at 21:15, Toshio wrote:
No errors from testing updates-released. All signatures verify and packages intact. List of space savings at the bottom of this message. The range is 19.5% - 99.2% savings with an average of 63.9%. I think more than halving the bandwidth is an achievement.
Something else I noticed is that bigger packages seem to have far worse space savings that smaller one which don't really matter that much.
Your average is misleading since it doesn't take in account package size and just the % of savings per each package.
99.07% in kernel-doc is really not impressive, and if you look at all those near the 90% mark... well, they're really not impressive.
Rui
Hello Rui,
Your average is misleading since it doesn't take in account package size and just the % of savings per each package.
Really? Then what does
Total RPM Size: 45140838 Total Delta Size: 16309380 Total Savings: 28831458 63.87% savings
mean?
well, they're really not impressive.
I would say that is a matter of opinion, and not of much interest. What matters is if such an implementation would generate real world bandwidth savings for people that can use those.
Bye, Leonard.
On Tue, 2004-01-27 at 11:59, Leonard den Ottolander wrote:
Your average is misleading since it doesn't take in account package size and just the % of savings per each package.
Really? Then what does Total RPM Size: 45140838 Total Delta Size: 16309380 Total Savings: 28831458 63.87% savings mean?
I was mislead by the term average, and thought the 63.87% were an average of the savings per package (which is in fact circa 70% and quite unreal). Ok, 64% is a big reduction indeed.
well, they're really not impressive.
I would say that is a matter of opinion, and not of much interest. What matters is if such an implementation would generate real world bandwidth savings for people that can use those.
It must be proved flawless (preferably with more than an empiric test throughout the historical updates, a real mathematical proof :)).
There's also the problem of the delta packages download server. Will it have to wast disk space by maintaining all deltas? I don't know how many mirrors will like that.
Rui
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here. Instead if you feel so strongly that it is technically good and without drawbacks, then implement everything needed and provide a test repository and tools at your project site. Only after you feel your project page, tools, and repository are PERFECT, then announce here for testing and comments.
I would suggest that you to research the problem that was discussed by Debian and Red Hat for at least the past 5 years. Many smart people have looked at this problem. Try not to repeat the same mistakes and learn from their prior discussions.
Do not discuss it here because generally the elders are totally not convinced that it is a good thing to do. I personally think it is possible and good to do as an OPTIONAL thing in cases where the diff is below a certain % of the total package size, like 10% for example. However I feel that we have more important things to work on that are higher priority, like Fedora Project's infrastructure, so I personally wont put any effort into this for at least a year or more.
Warren
On Tue, 2004-01-27 at 08:24, Warren Togami wrote:
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here. Instead if you
More importantly than "we don't want to hear that kind of stuff here" (sorry Warren, but that's how it sounds), this subject is off-topic on the list. It's not about Fedora, but about RPM, so it might be on-topic for rpm-list, or maybe even warrants a mailing list of its own. Be aware of that the some of the bespoke "elders" are on rpm-list, too ;-).
Nils
Hello Warren,
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here.
This is the Fedora devel list, right? Maybe more on topic on the RPM list, but still a development issue, so IMHO rather on topic in this forum.
I haven't fe seen people talking about kernel issues being referred to kernel.org, because their discussions would be more on topic there. If you are not interested in the issue don't read the threads.
Instead if you feel so strongly that it is technically good and without drawbacks, then implement everything needed and provide a test repository and tools at your project site. Only after you feel your project page, tools, and repository are PERFECT, then announce here for testing and comments.
Some discussion before implementation can help tackle some issues before hand. I thought xdeltas would be very inefficient, but these scriptlets Mike posted here look very promising.
I would suggest that you to research the problem that was discussed by Debian and Red Hat for at least the past 5 years. Many smart people have looked at this problem. Try not to repeat the same mistakes and learn from their prior discussions.
That is very wise advise. But for some reason some of the objections I could formulate actually don't seem to be an issue. Now please tell us where these scriptlets go wrong, what mistakes are being made that were already identified before?
Do not discuss it here because generally the elders are totally not convinced that it is a good thing to do.
If the "elders" don't want to listen they don't listen. As long as no huge flame wars break out I don't think these lists need this kind of moderation/censorship.
I personally think it is possible and good to do as an OPTIONAL thing in cases where the diff is below a certain % of the total package size, like 10% for example.
Optional for sure. I don't think anybody that likes the idea of binary patches (in whatever form) wants these to be the default update path. Don't change the default, just add an option for people that can use such bandwidth savings (the other 99% of the population).
However I feel that we have more important things to work on that are higher priority, like Fedora Project's infrastructure, so I personally wont put any effort into this for at least a year or more.
Nobody is asking you to put effort into this, but while I wait for the above and other issues to be solved I hope you don't mind me investing my time in things that I think are worth investigating.
Bye, Leonard.
Hi Leonhard,
On Tue, 2004-01-27 at 12:43, Leonard den Ottolander wrote:
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here.
This is the Fedora devel list, right? Maybe more on topic on the RPM list, but still a development issue, so IMHO rather on topic in this forum.
I haven't fe seen people talking about kernel issues being referred to kernel.org, because their discussions would be more on topic there. If you are not interested in the issue don't read the threads.
While this is a development issue, an RPM diffing tool is in no way Fedora specific (I hope). Kernel related discussion here seems (to me) concentrated on kernel-on-FC or FC-kernels, luckily we don't have the huge mess that is called lkml here. I think an RPM diffing tool would be most on-topic on rpm-list (out of the pool of existing mailing lists), what'd be on-topic here would be (when it's ready) discussing eventual inclusion in FC, whether or not it should be used for distribution of FC packages, etc...
IMO, Nils
Hi,
On Tue, Jan 27, 2004 at 12:57:41PM +0100, Nils Philippsen wrote:
On Tue, 2004-01-27 at 12:43, Leonard den Ottolander wrote:
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here.
This is the Fedora devel list, right? Maybe more on topic on the RPM list, but still a development issue, so IMHO rather on topic in this forum.
I haven't fe seen people talking about kernel issues being referred to kernel.org, because their discussions would be more on topic there. If you are not interested in the issue don't read the threads.
While this is a development issue, an RPM diffing tool is in no way Fedora specific (I hope). Kernel related discussion here seems (to me) concentrated on kernel-on-FC or FC-kernels, luckily we don't have the huge mess that is called lkml here. I think an RPM diffing tool would be most on-topic on rpm-list (out of the pool of existing mailing lists), what'd be on-topic here would be (when it's ready) discussing eventual inclusion in FC, whether or not it should be used for distribution of FC packages, etc...
I haven't seen any such fruitful discussion about this theme since long, and I enjoy the ideas and actual work being done on this. The thread is only a couple mails, it doesn't disturb and could lead to something very useful IMHO.
To get to the topic: There have been some comments on SuSE's way of doing this and why it is not accepted in rpm mainstream at the packaging list, which unfortunatley seems to have died a quite death in May:
http://mail.freestandards.org/pipermail/packaging/2003-March/000214.html
It shows that solutions are crafted within distributions and then presented upstream (and are being rejected, but that should not scare anyone off). It also contains valuable hints as to what kind of solutions rpm could accept upstream.
The rpm-list is also more a user list than a development list. The interesting development "discussions" about rpm can usually be found at bugzilla.redhat.com. Nevertheless posting a summary mail at rpm-list about the current status would not be wrong.
On Tue, 2004-01-27 at 07:13, Axel Thimm wrote:
To get to the topic: There have been some comments on SuSE's way of doing this and why it is not accepted in rpm mainstream at the packaging list, which unfortunatley seems to have died a quite death in May:
http://mail.freestandards.org/pipermail/packaging/2003-March/000214.html
Axel, that's an informative link! My reading of it is that jbj thinks deltas of rpms belongs in the distribution rather than the package manager :-)
He better defines my vague misgivings about the SuSE patch method and presents rsync as a transport method as an alternative (which people here have shot down as being too server intensive).
I think that the xdelta method bridges these two approaches. It is done at build time rather than at download so it doesn't bog down the download server. It creates a valid rpm which is then installed by the package manager so there doesn't need to be a separate install path through rpm.
Reasons it should be discussed on rpm-list: * Rpm could be modified to delta and patch rpms rather than an external tool
Reasons it should be discussed here: * Whether it is used or not is ultimately a distribution policy decision. * Tools besides rpm could benefit from it (yum could download a delta if the base package was available. Would we be able to list this in the common metadata format? etc.)
-Toshio
On Jan 28, 2004, Toshio toshio@tiki-lounge.com wrote:
He better defines my vague misgivings about the SuSE patch method and presents rsync as a transport method as an alternative (which people here have shot down as being too server intensive).
What's so server intensive about downloading pre-computed hashes from the server, then downloading the missing chunks of the file with http requests? Doesn't look like a big deal to me.
The tricky thing is that, for improved savings, you don't want to rsync the compressed bits, since those don't rsync all that well; you want to rsync the uncompressed bits, and then re-compress on the client. This would indeed make it more CPU-intensive on the server, and could pose some hurdles to make sure md5 hashes and signatures still match after the transfer, but it should be possible to work them out.
I think that the xdelta method bridges these two approaches.
xdelta requires the server to guess right which version the client has on its end, and last I looked it was non-reversible. A simplistic rsync-based solution OTOH just requires hashes for every package to be pre-computed on servers and made available for download.
What I envision is an RPM downloading mechanism that takes advantage of the pool of bits from an existing installation of a package (possibly also seeded with bits from other packages in a tree).
The end result is that, if you don't have anything in your local tree to begin with, you can download whatever version of the rpm you're interested in and it just works; if you already have another version of the package, it will be used to seed the rsync hash pool and hopefully speed the transfer up, and all of this has minimal impact on the server (other than disk space for the hashes), as long as an rsyncable version of gzip is used. If regular gzip is used, then the server would presumably have to cooperate with the client in uncompressing packages to improve download savings.
Hello Alexandre,
xdelta requires the server to guess right which version the client has on its end, and last I looked it was non-reversible.
I don't believe versioning is that much of an issue with the approach that Michael presents. Just always delta against the original rpm that came with the release, since everybody should have that one lying around (except for the 1 in 1000 people that do an FTP install from a remote server). Delete the last delta when a new update comes out, just like there usually only is one update rpm available.
A simplistic rsync-based solution OTOH just requires hashes for every package to be pre-computed on servers and made available for download.
Rsync puts a load on the server, which is probably not what you want. These delta's (have you tried these scriptlets yet?) can just be downloaded. Patching is done at the client end, and the resulting rpm is identical with the full version. The fact that rpm now uses the "minigzip algorithm" blows all objections against using xdelta out of the water. It's amazing.
These delta bundles have to be created only once and can be distributed beside the normal rpms. The user can choose whether to download the rpm and proceed as usual or get and apply the patch to the existing rpm (which would save him about 67% of the download according to a preliminary investigation by Toshio) if he is bandwidth challenged. You don't even have to integrate this into up2date or yum yet, just make an extra branch for deltas available in the FTP tree, so people who need them can download them.
The end result is that, if you don't have anything in your local tree to begin with, you can download whatever version of the rpm you're interested in and it just works;
As I mentioned above, the "local tree" should contain the initial rpm for 99% of the users. No need to put any load on the servers.
Bye, Leonard.
On Wed, 2004-01-28 at 13:22, Leonard den Ottolander wrote:
Rsync puts a load on the server, which is probably not what you want. These delta's (have you tried these scriptlets yet?) can just be downloaded. Patching is done at the client end, and the resulting rpm is identical with the full version. The fact that rpm now uses the "minigzip algorithm" blows all objections against using xdelta out of the water. It's amazing.
rsync currently puts a load on the server because it reads the file and builds a set of hash data for each request.
It would appear quite possible to produce either:- * a prebuilt table of file hashes using some external tool * a cache of hashes held by the rsync server
Much of this is similar to the problem of html index pages for directories on a public mirror system.
Nigel.
Hello Nigel,
rsync currently puts a load on the server because it reads the file and builds a set of hash data for each request.
It would appear quite possible to produce either:- * a prebuilt table of file hashes using some external tool * a cache of hashes held by the rsync server
Please have a look at the scripts that came with the post that started this thread. Do you have any reason to assume that these "rsync hashes" can be more efficient than the deltas generated by Mike's script?
By the way, minigzip.c comes with the zlib-devel rpm (/usr/share/doc/zlib-devel-<version>).
$ gcc minigzip.c -lz -o minigzip
Bye, Leonard.
On Wed, 2004-01-28 at 13:55, Leonard den Ottolander wrote:
rsync currently puts a load on the server because it reads the file and builds a set of hash data for each request.
It would appear quite possible to produce either:- * a prebuilt table of file hashes using some external tool * a cache of hashes held by the rsync server
Please have a look at the scripts that came with the post that started this thread. Do you have any reason to assume that these "rsync hashes" can be more efficient than the deltas generated by Mike's script?
This getting way off topic, *but* it would be very useful to have an rsync server which used/maintained a saved file hash set because it would be of general use on the mirrors, saving them significant CPU/IO usage (or saving them turning off the rsync crc code due to lack of CPU/IO).
Whether this would be particularly useful for the specific case of downloading update rpms when you have the original rpm is very debatable - probably not since the major rpm payload is compressed and probably not amenable to rsync.
However, even if this rpm delta stuff works well it requires the mirrors to carry another pile of files of the order of 50% of the size of current updates *as* *well* *as* the current updates themselves. As a mirror operator I have to say that the case for giving a distribution yet another huge chunk of disk for what appears to be a corner case seems pretty weak.
But this is still rattling on on a distribution list. A basic proof of concept has been posted (serious kudos to the guy who did it - it took me a good few minutes of puzzling to work out what the hell he had done), but this needs to be taken to the rpm list or some new rpm delta development forum and made into something that really could be used, including a file format (shipping pairs of delta files will not work), a naming, metadata and maybe signing system, and then think about integrating it. Present stuff is about as appropriate as discussing how we are going to handle the 2.8 (yes thats the right number) kernel in Fedora - and exercise in vapourware planning.
Nigel.
On Jan 28, 2004, Leonard den Ottolander leonard@den.ottolander.nl wrote:
Just always delta against the original rpm that came with the release, since everybody should have that one lying around
Lying around in the install CD? Or in .iso format on some NFS server used for the install? I guess these are the most common cases, and I somehow can't see them as convenient.
The most common case is that the person will the previous release of an update installed on their system. If you only generate deltas from the base release, you don't help such people.
If you generate deltas from base to first update, from first update to second, etc, it works but you may not save as much on downloads. If you generate deltas from each earlier release to the latest, you save on downloads, but potentially waste disk space.
Using an rsync-based download, OTOH, you don't need to generate deltas at all: you pre-generate the hashes on the server, the client downloads that, figures out which chunks it needs based on what it has on its end of the network, and requests only those from the server.
A simplistic rsync-based solution OTOH just requires hashes for every package to be pre-computed on servers and made available for download.
Rsync puts a load on the server, which is probably not what you want.
See the `pre-computed hashes' bit.
These delta's (have you tried these scriptlets yet?) can just be downloaded.
The problem with the deltas is that they're static, making a guess on what the client has on its end. rsync hashes, OTOH, make no such assumptions: they tell, with a very small footprint, what the client is about to download, such that the client can tell which bits it doesn't need. The fact that rpm uses minigzip makes it as suitable for rsyncing as for xdelta.
These delta bundles have to be created only once and can be distributed beside the normal rpms.
Just like the rsync hashes. But the rsync hashes are based only on the one (updated) binary rpm you're willing to distribute. No assumption on what the base version was. So it can be used for upgrades as much as it can be used for updates, without the overhead of generating xdeltas from every other earlier distro release, regardless of released updates.
As I mentioned above, the "local tree" should contain the initial rpm for 99% of the users. No need to put any load on the servers.
Not necessarily true for the half dozen kernel updates we've had in FC1 since its release. kernel-sources is pretty big, and up2date -u replaces the installed files and then throws the .rpm file away.
The rsync protocol-based solution I'm proposing would reconstruct the rpm file from the installed files, and use that to save on the download.
On Wed, 2004-01-28 at 16:39, Alexandre Oliva wrote:
The problem with the deltas is that they're static, making a guess on what the client has on its end. rsync hashes, OTOH, make no such assumptions: they tell, with a very small footprint, what the client is about to download, such that the client can tell which bits it doesn't need. The fact that rpm uses minigzip makes it as suitable for rsyncing as for xdelta.
While I really prefer this solution, the assertion that minigzip makes the packages "rsync-able" does not hold up. The rpm-delta proof-of-concept code used rpm2cpio to extract the rpm payload, and this implicitly performs a un-minigzip operation on the payload. The uncompressed cpio payloads are then given to xdelta.
I have just tried using a couple of large rpms (on the assumption that these should have more commonality between minor version tweaks than small rpms) rsync-ing an updated rpm onto the previous rpm.
For a Fedora kernel rpm (kernel-2.4.22-1.2140.nptl.i686.rpm and kernel-2.4.22-1.2149.nptl.i686.rpm) there was effectively no speedup seen by pre-heating the destination file with its previous version. The extra negotiation required was of a very similar size to the actual commonality between the packages.
With the glibc-common package (picked as it was quite likely to have no changes, and particularly would not suffer from object file timestamps), there was a slight advantage - 1717760 bytes out of 11193793 were saved, but at a cost of 60674 in protocol overhead sending the hashes.
I would guess that the commonality may well be the rpm internal metadata - the payload compression would cause the data stream to become very different quite quickly.
The numbers don't justify this with rsync at present unless rpm were changed to make the compression be done on a per payload file, or possibly per block basis. That might well make the basic rpms larger and would certainly require a rpm filespec and tools update, with backwards compatibility issues.
The rsync protocol-based solution I'm proposing would reconstruct the rpm file from the installed files, and use that to save on the download.
This doesn't seem to fly based on my very quick suck-it-and-see tests. Anyone got numbers to contradict me?
Nigel.
Hello Nigel,
While I really prefer this solution, the assertion that minigzip makes the packages "rsync-able" does not hold up. The rpm-delta proof-of-concept code used rpm2cpio to extract the rpm payload, and this implicitly performs a un-minigzip operation on the payload. The uncompressed cpio payloads are then given to xdelta.
I guess that what makes that minigzip makes this work is that it'll produce the same output on the same input (correct me if I'm wrong). It might have nothing to do with the rsyncability patch to gzip, as I suggested in an earlier mail.
But these steps could be implemented inside rsync, and the precomputed hashes could be computed on the unminigzipped cpios and header parts.
Bye, Leonard.
On Wed, 28 Jan 2004, Nigel Metheringham wrote:
I have just tried using a couple of large rpms (on the assumption that these should have more commonality between minor version tweaks than small rpms) rsync-ing an updated rpm onto the previous rpm.
For a Fedora kernel rpm (kernel-2.4.22-1.2140.nptl.i686.rpm and kernel-2.4.22-1.2149.nptl.i686.rpm) there was effectively no speedup seen by pre-heating the destination file with its previous version. The extra negotiation required was of a very similar size to the actual commonality between the packages.
With the glibc-common package (picked as it was quite likely to have no changes, and particularly would not suffer from object file timestamps), there was a slight advantage - 1717760 bytes out of 11193793 were saved, but at a cost of 60674 in protocol overhead sending the hashes.
I think your figures are a bit misleading, because I suspect you were rsyncing the plain rpms, which will always be close to the rpm size, as you are basically copying the compressed cpio archive, and only doing something clever with the header. Here are my figures (xdelta is the sum of the two delta files, rsync is the sum of the download and checksums) glibc-common 2.3.2-101->101.4 xdelta 113194+19199 =132393 plain rpm rsync 11624765+21598 =11646363 extracted cpio rsync 4775981+46772 =4822753 straight download =12903889
kernel-2.4.22-1.2115.nptl.i686->2149 xdelta 5591072+30088 =5621160 plain rpm rsync 12774882+21586 =12796468 extracted cpio rsync 25674282+32848 =25707130 straight download =12798261
The cpio rsync is of course only part of the traffic you would need to download as it ignores the header parts of the rpm.
Michael Young
Hello Alexandre,
The fact that rpm uses minigzip makes it as suitable for rsyncing as for xdelta.
That is correct, I had realized that, and I have no objection against somebody implementing this in rsync (I have considered this possibility as a result of you letting me know about rsyncing ISOs). But I do not share your objections that people would not have the original rpm lying around (even if it were located on an NFS server, then that server would most probably also be used to retrieve updates from, in which case the patching would be done on that machine).
But the rsync hashes are based only on the one (updated) binary rpm you're willing to distribute. No assumption on what the base version was.
You might be correct that this would be a superior approach, but this deltaing works already (it only needs some code to handle the versioning, but it can already be used manually in it's crude current form).
Plus I am afraid that this rsync hashing (or deltaing) might not work efficiently on highly different versions of a software, ie release updates. That should be tested. (Otoh, I believe it could be very efficient on srpms (if the tar balls would be minigzipped instead of gzipped that is).)
As I mentioned above, the "local tree" should contain the initial
rpm
for 99% of the users. No need to put any load on the servers.
Not necessarily true for the half dozen kernel updates we've had in FC1 since its release.
It would be save to assume that if people have the need to use these deltas due to bandwidth limitations they are willing to keep the original rpm lying around. And don't most people burn the original ISOs to cd?
kernel-sources is pretty big, and up2date -u replaces the installed files and then throws the .rpm file away.
Personally I only use up2date to check if updates are available, but I assume that up2date can be taught to keep the rpm around (although that is not really relevant if you assume the original/initial rpm to be used as reference).
The rsync protocol-based solution I'm proposing would reconstruct the rpm file from the installed files, and use that to save on the download.
Great! Yet another method for people to use instead of downloading the whole rpm. Since the compression vs xdeltaing issue was obviously solved without anybody noticing (apart from Michael) I don't see theoretical problems in implementing this in rsync.
Bye, Leonard.
On Wed, 28 Jan 2004, Alexandre Oliva wrote:
On Jan 28, 2004, Leonard den Ottolander leonard@den.ottolander.nl wrote:
Just always delta against the original rpm that came with the release, since everybody should have that one lying around
Lying around in the install CD? Or in .iso format on some NFS server used for the install? I guess these are the most common cases, and I somehow can't see them as convenient.
The people who rsync/xdelta would help most are those who have bandwidth problems, and they will probably have installed from CD. It might be possible to recreate an rpm from installed files (possibly using some extra downloads to fill missing info/config files).
The most common case is that the person will the previous release of an update installed on their system. If you only generate deltas from the base release, you don't help such people.
True. Probably the best way to do this would be incremental deltas, though I agree it isn't ideal.
Using an rsync-based download, OTOH, you don't need to generate deltas at all: you pre-generate the hashes on the server, the client downloads that, figures out which chunks it needs based on what it has on its end of the network, and requests only those from the server.
The catch here is that you not only have to store the hashes, but you also have to store the uncompressed rpm file, or uncompress it on the fly for each update, which hits either the cpu or the storage space on the server. You also either have to have a special rsync like server, or you have several http connections to retrieve the file bits. A static xdelta results in a single file fetch and no extra server load.
rsync does give you the chance to start from a close but not exact starting point, however it does this at the expense of bandwidth because of the hash transmission, and because it uses larger blocks. Hence you use more bandwidth for a rsync download than with a static xdelta, though I haven't tested this to see by how much.
Michael Young
Hi Nils,
While this is a development issue, an RPM diffing tool is in no way Fedora specific (I hope).
I think an RPM diffing tool would be most on-topic on rpm-list (out of the pool of existing mailing lists),
I have no problem continuing this discussion on the rpm list (although the question where this is most on topic remains), but I hope you don't expect me to go sit in an empty room and wait for people to start talking to me ;-) .
Bye, Leonard.
On Tuesday 27 January 2004 02:24 am, Warren Togami wrote:
For those folks that want to eventually have optional RPM diffs for upgrades, please do not continue the discussion here. Instead if you feel so strongly that it is technically good and without drawbacks, then implement everything needed and provide a test repository and tools at your project site. Only after you feel your project page, tools, and repository are PERFECT, then announce here for testing and comments.
??? While I understand your concerns, Warren, this sounds arrogant. You may not have meant it that way, but it is the way it sounds to me, at least. This is not an annnounce list, nor is it the beta list, nor is it the user list. It is the devel list, and this is a development issue. If it's a bandwidth concern of wasting your bandwidth reading this, then you are the perfect candidate for a bandwidth saving RPM patch method.
I would suggest that you to research the problem that was discussed by Debian and Red Hat for at least the past 5 years. Many smart people have looked at this problem. Try not to repeat the same mistakes and learn from their prior discussions.
It is useless to say 'learn from the past problem and discussion' without providing at least a small pointer to said discussion. If signing is the problem, it appears that that is being worked on. And things change enough in five years to warrant a fresh look. Just because some smart people have looked at the problem does not mean that the person with the right perspective and the right insight hasn't yet looked at the problem. If where to place the baseline is the problem, that is a discussion that can only take place here, since the baseline for the update would revolve around the releases. There are, of course, other ways to do it: that's why I've been keeping quiet during this latest round of discussions. I'm finding some fascinating points are being talked about that I hadn't thought of.
Do not discuss it here because generally the elders are totally not convinced that it is a good thing to do.
<sarcasm>And one wouldn't want to offend the sacred elders. </sarcasm>
However I feel that we have more important things to work on that are higher priority, like Fedora Project's infrastructure, so I personally wont put any effort into this for at least a year or more.
An RPM diff mechanism qualifies as a part of the infrastructure. Decisions about the mechanism for distributing these diffs is an infrastructure concern. Just because some developers feel threatened by this discussion does not mean all do. And I personally believe and think that this discussion belongs here, since this is the group that needs convincing of the utility of this approach. There is a prototype script; let's bang on it. The whole mechanism shouldn't have to appear fully formed and tested; although I suspect that you would have some objection even then. Perfection is never fully achieved; one man's perfection is another man's bug-ridden-mess. But you are certainly entitled to an opinion of otherwise. Of course, the fact the SuSE already has a full infrastructure using a similar system that is working, is working well, and is quite well tested is not even considered, because it was Not Invented Here.
But if you don't want to participate in this facet of development, then don't. Others are. Others who might not have participated in the facet of infrastructure that you are rightly concerned about, maybe are involved in this. But I do agree that rpm-list might be the best venue for this discussion. But it is an appropriate discussion here, too.
My interest in this is due to package sizes for frequently updated components (my component, PostgreSQL, fortunately isn't very frequently updated) such as the kernel and glibc. KDE and X updates should benefit as well.
And this part of the infrastructure has the potential to benefit users of the dist in a more tangible way than some other parts of the infrastructure. IMO, YMMV, etc. Put simply: 'You can get your updates faster.'
Michael, you're a genius!
Here is a couple of demo short scripts for generating and applying deltas between two rpms.
This is so unbelievably simple that for a while I thought you were pulling your own leg.
All one needs to do is bundle the two deltas with a text file (or xml if you want to be fancy) describing the versioning and let the script handle the naming.
All this needs to be used is an extra branch for deltas, next to the updates branch. And I thought xdeltas would be too expensive. I guess minigzip changes all that. It's unbelievable.
Bye, Leonard.