On Wed, 2008-01-23 at 21:41 -0500, Warren Togami wrote:
I just had an in-depth discussion with Henrik Nordström of the Squid
project about how HTTP mirrors and the yum tool itself could be improved
to safely handle proxy caches. He gave me lots of good advice about how
HTTP mirrors can be configured for cache safety, Squid can be configured
for yum metadata cache safety, and yum itself can be improved to be more
robust in dealing with proxy caches.
(It turns out that Henrik is an avid Fedora user, and I might have
convinced him to come onboard the Fedora Project to contribute another
useful tool and become co-maintainer of his own package. It would be an
honor to have him onboard as a Fedora Developer. =)
You might have had a small discussion on #yum then, as any of the
regulars there know the answers to all of your questions.
Yum and Proxy Caches: Current Dangers
=====================================
Users may be using proxy servers in 3 (or more) ways:
1) Many users today are behind a transparent proxy cache, either
instituted by their ISP, school, or business network.
2) Other users might have Internet access *only* through a proxy server.
3) Other users might be using a reverse proxy server on their local
network as a caching yum mirror.
There are two cases where yum has problems with proxy caches:
1) A RPM package changes content without changing filename. This
usually happens only in instances where a package was pushed unsigned
then was later signed. A simple workaround within yum is discussed
later in this mail.
2) yum currently has problems with proxy caches due to common cases
where metadata can become partially out of sync. This happens because
repomd.xml is grabbed often while other repodata files are grabbed less
often. repomd.xml is then checked for origin "freshness" more often.
When repodata changes on the origin, repomd.xml is refreshed on the
cache before other repodata files. yum clients seeing the new
repomd.xml but old primary.sqlite.bz2 error out.
#2 is worked around as good as is possible, in the upcoming 3.2.9, in
that yum will basically create a transaction over the repomd.xml and the
metadata itself. If you use mdpolicy=group:all ... this will always
work, the downside is that you'll need to download all of the metadata
so the default is not that.
Ideal Solution for #2 Partial Repodata Sync Problem
===================================================
Henrik highly suggests using versioned repodata files as the ideal
solution to this problem. This way caches can serve repodata without
fear of the sync problem, and also without querying the origin server
upon every client download. repomd.xml would contain changing filenames
perhaps with timestamp or something in their filenames.
i.e.
primary-1201140584.sqlite.bz2
This would be an elegant solution, but will it be possible for us to
migrate to because older clients wouldn't be able to handle it?
I'm guessing not, so here are other less efficient but workable solutions.
We've discussed this and think this is probably the best solution, but:
1. Don't use timestamps, use the sha1 of the file, because then
multiple createrepo's runs will always create the same filenames.
2. This requires work inside yum as atm. yum doesn't do any cleanup on
it's metadata downloads so /var/cache/yum would grow without bound
(although "yum clean ..." will work).
...we can fix #2 for 3.2.9, so we could do this in Fedora 9 onwards.
"Cache-Control: max-age=0"
==========================
This HTTP header directive can be either in the request or response.
This instructs the proxy cache server to always query the origin HTTP
server to check if the requested file has changed. It compares the
origin's reported Last-Modified or ETag to what Squid knows in its own
cache.
This means that each and every request for repodata/* files will trigger
a query to the origin server. This is a relatively quick operation and
an acceptable compromise if we cannot make repodata filenames versioned.
This is a horrible hack, IMO, and I can pretty much guarantee that not
all of the mirrors will do this. If it was possible for us to control
all of the mirrors then we could just require them all to setup ETags
and use that ... but again, I think that's hoping for way too much.
[...]
Yum and "X-Cache: HIT"
======================
If you use wget --server-response and a target file, you see the raw
HTTP headers of that request. If the file is already cached, you see a
HTTP header like below:
X-Cache: HIT from
proxyserver.example.com
Proposal:
Improve yum with the following download logic:
IF (a downloaded repodata/* file doesn't match the repomd.xml checksum
OR a downloaded RPM doesn't match the expected checksum)
AND "X-Cache: HIT from" was in its HTTP header
THEN download it again with URLGrabber option: http_headers =
(('Pragma', 'no-cache')
This should solve the case where RPM files legitimately change contents
without changing filenames, like RPM signing. This also correctly does
NOT trigger additional downloads upon other errors like corrupted files.
You'd have to do this change inside URLgrabber itself, as by the time
yum could react to it URLgrabber would already have decided to remove
that mirror from it's list and moved on.
--
James Antill <james.antill(a)redhat.com>
Red Hat