Search squid archive

Re: Refresh Pattern useless?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 29.11.2012 13:31, Joshua B. wrote:
I'm beginning to conclude that refresh pattern in Squid is useless.
I had a neat refresh pattern which is supposed to help cache just about everything, below:

refresh_pattern ([^.]+\.)?(download|(windows)?update)\.(microsoft\.)?com/.*\.(cab|exe|msi|msp|psf) 4320 100% 43200 override-expire reload-into-ims ignore-reload ignore-no-cache ignore-private ignore-auth ignore-no-store refresh_pattern ([^.]+.|)(download|adcdownload).(apple.|)com/.*\.(pkg|dmg) 4320 100% 43200 override-expire reload-into-ims ignore-reload ignore-no-cache ignore-private ignore-auth ignore-no-store

The above pattern matches:
 .*download.com/\.(pkg|dmg)

There are no limits on where in the URL that string may occur ... <img src="http://example.com?downloadXcom/.pkg"; /> ... ouch.


refresh_pattern ([^.]+.|)avg.com/.*\.(bin) 4320 100% 43200 reload-into-ims refresh_pattern ([^.]+.|)spywareblaster.net/.*\.(dtb) 4320 100% 64800 reload-into-ims refresh_pattern ([^.]+.|)symantecliveupdate.com/.*\.(zip|exe) 43200 100% 43200 reload-into-ims refresh_pattern ([^.]+.|)avast.com/.*\.(vpu|vpaa) 4320 100% 43200 reload-into-ims refresh_pattern (avgate|avira).*(idx|gz)$ 1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload reload-into-ims refresh_pattern kaspersky.*\.avc$ 1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload reload-into-ims

Problem #1:
There are limits on how large the numbers can be. Newer ones will check for integer overflow when 999999/100 is multiplied by the age in seconds of years-old objects and truncate it to 1 year forward storage. Older squid wold just let you store for negative amounts of time if the overflow went badly ... negative values means stale/discard/MISS in refresh calculations.


refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 10080 90% 43200 override-expire ignore-no-cache ignore-no-store ignore-private refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|swf|flv|x-flv)$ 43200 90% 432000 override-expire ignore-no-cache ignore-no-store ignore-private refresh_pattern -i \.(deb|rpm|exe|zip|tar|tgz|ram|rar|bin|ppt|doc|tiff)$ 10080 90% 43200 override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.index.(html|htm)$ 0 40% 10080
refresh_pattern -i \.(html|htm|css|js)$ 1440 40% 40320

----

And it barely cached any of the content it's supposed to. I never once saw "TCP_HIT" in the logs. And it seems like when I removed these refresh patterns (leaving the defaults), I finally saw TCP_HIT's in the log file...
So is refresh pattern useless? Or am I just doing this wrong??!!

No and maybe.

Overall it is a good idea NOT to use refresh_pattern unless you have to. And definitely NOT to use the ignore/override options unless you have a very specific reason fro each one with some good research to back up why you need it. Sites change over time and so does Squid behaviour...
 For example;
facebook use to be very cache unfriendly, people would force caching in order to get the images and posts not to be downloaded repeatedly by every user. Sicne a year or so ago FB supply proper cache controls to roll out their scrolling updates live and allow safe caching of images and timeline pages - all those patterns forcing long-term caching of FB responses are now only screwing up the users sidebars with live-feed content and causing user-A to download user-B email contacts lists etc on the wigit exporter APIs.
 For another example;
squid-2.x and 3.x up to 3.1 are HTTP/1.0 and handle "no-cache" parameters according to HTTP/1.0. But squid-3.2 is HTTP/1.1 where "no-cache" means subtly different things. "ignore-no-cache" will now *reduce* the HIT ratio in a lot of traffic cases... http://squidproxy.wordpress.com/2012/10/16/squid-3-2-pragma-cache-control-no-cache-versus-storage/


Problem #2:
Objects in HTTP/1.* are supposed to be delivered with instructions from the server about their existence, lifetime and storeage ability etc.

refresh_pattern is only designed to be the *backup* which tells Squid some parameters when they are not supplied by the server. min/max age to store things from, at what % of its lifetime to start testing with the server whether things are still fresh or not. It has been hacked about with ignore-*/override-* options to make the algorithm pretend that certain details were never supplied even if they were, or to outright replace the servers details in the traffic with something of your own.

Quite nasty in the ways they interact and VERY easy to get wrong when fiddling with some other developers website. For example; by using "ignore-private ignore-auth" you have declared that you know better than any of the developers at Microsoft or Apple whether they will ever be sending confidential information in private or authenticated traffic to certain domains. That MIGHT be right, but only they actually know 100% or what will be delivered marked 'private' so how can you be that sure? risky.



Anyhow, back to ...

Problem #3:
Modern websites have a lot of dynamic content. If you check your logs you will I think find some traffic behaviours are very common.

* how many requests can you actually find which actually ask for index.* instead of just for "blahblah/" (note '/' at the end). In an ideal world they redirect, but it is more efficient to simply supply the page and they tend to do that.

* how many requests just fetch .JS / .CSS / .HTML with no '?' and parameters? The patterns for file types above assume (using 'blah$') that parameters are never sent.

... I think you will find that most of your requests are using these modern URL design techniques. So some pattern changes you WILL need:

1) remove those ([^.]+.|) at the start of your updater service patterns. Or better: replace them with the more easily understood ^[a-z]+://

2) also in those per-domain patterns replace .com and .net with \.com and \.net to ensure the '.' is matched properly.

3) replace the $ with (\?.*)?$ at the end of your file-type patterns. This will allow them to actually match files with parameters passed to the server. Note that caching things when the server is known to be dynamic and is not supplying cacheability limits is an HTTP violation and can cause some sites (like facebook and docs.google) which present per-user script files from shared API URLs to screw up.


I want to be able to cache Windows Updates, Apple Updates and possibly Linux Repositories as well (without some other fancy program for that). I also want to be able to cache various Anti-Virus vendors sites, so updating virus signatures are a lot faster. And to be able to cache generic content such as images, media files and software.


Windows Updates is a difficult subject: http://wiki.squid-cache.org/SquidFaq/WindowsUpdate

Apple Updates I don't know much about. Either they supply accurate and useful caching controls or they don't, if they do absolutely DO NOT use refresh pattern to mangle them up. If they are like WU and doing weird stuff some close inspection is needed and a report of those details found would be very useful to a lot of people here ;-)

AV vendors tend to be cache friendly but do supply short lifetimes on their data. When you think about it that is a GOOD thing. In particular it is useless having your clients doing an AV update daily and getting last years definitions file. If anything use refresh_pattern to *shorten* long lifetimes on their packages, but normally you would not want that.

The same logic from AV goes for any updater service, caching the large files is usually fine but do be very careful about the ages and privacy details. Some D/L are signed with security keys per-user so a sending a file to other users from cache will only result in 'corrupt' downloads and huge amounts of traffic wasted.



I want to be able to do this all in Squid... But it seems useless....

Unless someone else has other suggestions? The only thing I'm seeing TCP_HIT's on is video content, as I'm using VideoCache. But this isn't enough... I want just about everything to be cached. My proxy server is a dedicated proxy system with 1 TB of hard drive space for caching.

My squid configuration can be viewed here: http://pastebin.com/XW5yZmvk

Problem #4:
non-refresh_pattern controls on caching. As you can see from the WindowsUpdate wiki page there are a bunch of other controls directly affecting cache usage in Squid.

Amos


[Index of Archives]     [Linux Audio Users]     [Samba]     [Big List of Linux Books]     [Linux USB]     [Yosemite News]

  Powered by Linux