Yum, Proxy Cache Safety, Storage Backend
wtogami at redhat.com
Thu Jan 24 02:41:40 UTC 2008
I just had an in-depth discussion with Henrik Nordström of the Squid
project about how HTTP mirrors and the yum tool itself could be improved
to safely handle proxy caches. He gave me lots of good advice about how
HTTP mirrors can be configured for cache safety, Squid can be configured
for yum metadata cache safety, and yum itself can be improved to be more
robust in dealing with proxy caches.
(It turns out that Henrik is an avid Fedora user, and I might have
convinced him to come onboard the Fedora Project to contribute another
useful tool and become co-maintainer of his own package. It would be an
honor to have him onboard as a Fedora Developer. =)
Yum and Proxy Caches: Current Dangers
Users may be using proxy servers in 3 (or more) ways:
1) Many users today are behind a transparent proxy cache, either
instituted by their ISP, school, or business network.
2) Other users might have Internet access *only* through a proxy server.
3) Other users might be using a reverse proxy server on their local
network as a caching yum mirror.
There are two cases where yum has problems with proxy caches:
1) A RPM package changes content without changing filename. This
usually happens only in instances where a package was pushed unsigned
then was later signed. A simple workaround within yum is discussed
later in this mail.
2) yum currently has problems with proxy caches due to common cases
where metadata can become partially out of sync. This happens because
repomd.xml is grabbed often while other repodata files are grabbed less
often. repomd.xml is then checked for origin "freshness" more often.
When repodata changes on the origin, repomd.xml is refreshed on the
cache before other repodata files. yum clients seeing the new
repomd.xml but old primary.sqlite.bz2 error out.
Ideal Solution for #2 Partial Repodata Sync Problem
Henrik highly suggests using versioned repodata files as the ideal
solution to this problem. This way caches can serve repodata without
fear of the sync problem, and also without querying the origin server
upon every client download. repomd.xml would contain changing filenames
perhaps with timestamp or something in their filenames.
This would be an elegant solution, but will it be possible for us to
migrate to because older clients wouldn't be able to handle it?
I'm guessing not, so here are other less efficient but workable solutions.
This HTTP header directive can be either in the request or response.
This instructs the proxy cache server to always query the origin HTTP
server to check if the requested file has changed. It compares the
origin's reported Last-Modified or ETag to what Squid knows in its own
This means that each and every request for repodata/* files will trigger
a query to the origin server. This is a relatively quick operation and
an acceptable compromise if we cannot make repodata filenames versioned.
This HTTP directive can be done for repodata/* files at three levels:
1) Origin HTTP mirrors can be configured to serve "Cache-Control:
max-age=0" in HTTP headers whenever they serve repodata/* files. This
can become a standard recommendation for all Fedora mirrors. Does
anyone know how to configure Apache to do this?
2) Squid refresh_pattern can use a regex to override max-age=0 for
repodata/* files. I haven't figured out exactly what the syntax is for
this. Anybody know squid.conf?
3) yum can always include the HTTP directive in its request for
repodata/* files. Can we make this the default in future versions of
yum? I personally don't see a drawback (unless repodata becomes
versioned, then we don't want this.)
Yum and "X-Cache: HIT"
If you use wget --server-response and a target file, you see the raw
HTTP headers of that request. If the file is already cached, you see a
HTTP header like below:
X-Cache: HIT from proxyserver.example.com
Improve yum with the following download logic:
IF (a downloaded repodata/* file doesn't match the repomd.xml checksum
OR a downloaded RPM doesn't match the expected checksum)
AND "X-Cache: HIT from" was in its HTTP header
THEN download it again with URLGrabber option: http_headers =
This should solve the case where RPM files legitimately change contents
without changing filenames, like RPM signing. This also correctly does
NOT trigger additional downloads upon other errors like corrupted files.
Squid Configuration Suggestion
This option is not default, but pretty useful for proxy servers of our
type. This option makes multiple clients asking for the same file not
yet in the server's cache to wait on the same origin download connection
instead of spawning more downloads of the same thing.
Upstream Squid on Adapting Squid's Storage Engine
<hno> Squid currently has an abstract index keyed on integers which
makes this hard to implement, but we are planning to break that out from
Squid allowing the cache to structure itself in whatever manner, with
one possible approach to use store the URLs as-is in the filesystem
(within certain bounds)
<hno> Apache do not have this same abstract internal layer, and writing
a mod_disk_cache replacement which keeps a mirror type file structure
should be pretty easy thing to do.
<adri> Its at least on my draft squid-2 roadmap for ~6 months from now
<adri> Since its going to be important for people running squid in
environments where high number sof lookups for large objects isn't
required, but they want to cache gigabytes/terabytes of $LARGE objects
<adri> without having huge amounts of RAM involved
So unfortunately squid currently cannot be adapted to be the perfect
InstantMirror, but we might be able to achieve it quickly by adapting
Apache's mod_disk_cache. Anybody touched this part of Apache before?
wtogami at redhat.com
More information about the devel