I just had an in-depth discussion with Henrik Nordström of the Squid project about how HTTP mirrors and the yum tool itself could be improved to safely handle proxy caches. He gave me lots of good advice about how HTTP mirrors can be configured for cache safety, Squid can be configured for yum metadata cache safety, and yum itself can be improved to be more robust in dealing with proxy caches. (It turns out that Henrik is an avid Fedora user, and I might have convinced him to come onboard the Fedora Project to contribute another useful tool and become co-maintainer of his own package. It would be an honor to have him onboard as a Fedora Developer. =) Yum and Proxy Caches: Current Dangers ===================================== Users may be using proxy servers in 3 (or more) ways: 1) Many users today are behind a transparent proxy cache, either instituted by their ISP, school, or business network. 2) Other users might have Internet access *only* through a proxy server. 3) Other users might be using a reverse proxy server on their local network as a caching yum mirror. There are two cases where yum has problems with proxy caches: 1) A RPM package changes content without changing filename. This usually happens only in instances where a package was pushed unsigned then was later signed. A simple workaround within yum is discussed later in this mail. 2) yum currently has problems with proxy caches due to common cases where metadata can become partially out of sync. This happens because repomd.xml is grabbed often while other repodata files are grabbed less often. repomd.xml is then checked for origin "freshness" more often. When repodata changes on the origin, repomd.xml is refreshed on the cache before other repodata files. yum clients seeing the new repomd.xml but old primary.sqlite.bz2 error out. Ideal Solution for #2 Partial Repodata Sync Problem =================================================== Henrik highly suggests using versioned repodata files as the ideal solution to this problem. This way caches can serve repodata without fear of the sync problem, and also without querying the origin server upon every client download. repomd.xml would contain changing filenames perhaps with timestamp or something in their filenames. i.e. primary-1201140584.sqlite.bz2 This would be an elegant solution, but will it be possible for us to migrate to because older clients wouldn't be able to handle it? I'm guessing not, so here are other less efficient but workable solutions. "Cache-Control: max-age=0" ========================== This HTTP header directive can be either in the request or response. This instructs the proxy cache server to always query the origin HTTP server to check if the requested file has changed. It compares the origin's reported Last-Modified or ETag to what Squid knows in its own cache. This means that each and every request for repodata/* files will trigger a query to the origin server. This is a relatively quick operation and an acceptable compromise if we cannot make repodata filenames versioned. This HTTP directive can be done for repodata/* files at three levels: 1) Origin HTTP mirrors can be configured to serve "Cache-Control: max-age=0" in HTTP headers whenever they serve repodata/* files. This can become a standard recommendation for all Fedora mirrors. Does anyone know how to configure Apache to do this? 2) Squid refresh_pattern can use a regex to override max-age=0 for repodata/* files. I haven't figured out exactly what the syntax is for this. Anybody know squid.conf? 3) yum can always include the HTTP directive in its request for repodata/* files. Can we make this the default in future versions of yum? I personally don't see a drawback (unless repodata becomes versioned, then we don't want this.) Yum and "X-Cache: HIT" ====================== If you use wget --server-response and a target file, you see the raw HTTP headers of that request. If the file is already cached, you see a HTTP header like below: X-Cache: HIT from proxyserver.example.com Proposal: Improve yum with the following download logic: IF (a downloaded repodata/* file doesn't match the repomd.xml checksum OR a downloaded RPM doesn't match the expected checksum) AND "X-Cache: HIT from" was in its HTTP header THEN download it again with URLGrabber option: http_headers = (('Pragma', 'no-cache') This should solve the case where RPM files legitimately change contents without changing filenames, like RPM signing. This also correctly does NOT trigger additional downloads upon other errors like corrupted files. Squid Configuration Suggestion ============================== collapsed_forwarding This option is not default, but pretty useful for proxy servers of our type. This option makes multiple clients asking for the same file not yet in the server's cache to wait on the same origin download connection instead of spawning more downloads of the same thing. Upstream Squid on Adapting Squid's Storage Engine ================================================= <hno> Squid currently has an abstract index keyed on integers which makes this hard to implement, but we are planning to break that out from Squid allowing the cache to structure itself in whatever manner, with one possible approach to use store the URLs as-is in the filesystem (within certain bounds) <hno> Apache do not have this same abstract internal layer, and writing a mod_disk_cache replacement which keeps a mirror type file structure should be pretty easy thing to do. <adri> Its at least on my draft squid-2 roadmap for ~6 months from now <adri> Since its going to be important for people running squid in environments where high number sof lookups for large objects isn't required, but they want to cache gigabytes/terabytes of $LARGE objects <adri> without having huge amounts of RAM involved So unfortunately squid currently cannot be adapted to be the perfect InstantMirror, but we might be able to achieve it quickly by adapting Apache's mod_disk_cache. Anybody touched this part of Apache before? Warren Togami wtogami@xxxxxxxxxx -- fedora-devel-list mailing list fedora-devel-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/fedora-devel-list