Re: Refresh Pattern useless?

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Thu, 29 Nov 2012 14:54:50 +1300

On 29.11.2012 13:31, Joshua B. wrote:
I'm beginning to conclude that refresh pattern in Squid is useless.
I had a neat refresh pattern which is supposed to help cache just 
about everything, below:

refresh_pattern 
([^.]+\.)?(download|(windows)?update)\.(microsoft\.)?com/.*\.(cab|exe|msi|msp|psf) 
4320 100% 43200 override-expire reload-into-ims ignore-reload 
ignore-no-cache ignore-private ignore-auth ignore-no-store
refresh_pattern 
([^.]+.|)(download|adcdownload).(apple.|)com/.*\.(pkg|dmg) 4320 100% 
43200 override-expire reload-into-ims ignore-reload ignore-no-cache 
ignore-private ignore-auth ignore-no-store

The above pattern matches:
 .*download.com/\.(pkg|dmg)

There are no limits on where in the URL that string may occur ...  <img 
src="http://example.com?downloadXcom/.pkg"; /> ... ouch.

refresh_pattern ([^.]+.|)avg.com/.*\.(bin) 4320 100% 43200 
reload-into-ims
refresh_pattern ([^.]+.|)spywareblaster.net/.*\.(dtb) 4320 100% 
64800 reload-into-ims
refresh_pattern ([^.]+.|)symantecliveupdate.com/.*\.(zip|exe) 43200 
100% 43200 reload-into-ims
refresh_pattern ([^.]+.|)avast.com/.*\.(vpu|vpaa) 4320 100% 43200 
reload-into-ims
refresh_pattern (avgate|avira).*(idx|gz)$                            
  1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload 
reload-into-ims
refresh_pattern kaspersky.*\.avc$                                    
  1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload 
reload-into-ims

Problem #1:
 There are limits on how large the numbers can be. Newer ones will 
check for integer overflow when 999999/100 is multiplied by the age in 
seconds of years-old objects and truncate it to 1 year forward storage. 
Older squid wold just let you store for negative amounts of time if the 
overflow went badly ... negative values means stale/discard/MISS in 
refresh calculations.

refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 10080 90% 43200 
override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|swf|flv|x-flv)$ 43200 
90% 432000 override-expire ignore-no-cache ignore-no-store 
ignore-private
refresh_pattern -i 
\.(deb|rpm|exe|zip|tar|tgz|ram|rar|bin|ppt|doc|tiff)$ 10080 90% 43200 
override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.index.(html|htm)$ 0 40% 10080
refresh_pattern -i \.(html|htm|css|js)$ 1440 40% 40320

----

And it barely cached any of the content it's supposed to. I never 
once saw "TCP_HIT" in the logs.
And it seems like when I removed these refresh patterns (leaving the 
defaults), I finally saw TCP_HIT's in the log file...
So is refresh pattern useless? Or am I just doing this wrong??!!

No and maybe.

Overall it is a good idea NOT to use refresh_pattern unless you have 
to. And definitely NOT to use the ignore/override options unless you 
have a very specific reason fro each one with some good research to back 
up why you need it. Sites change over time and so does Squid 
behaviour...
 For example;
  facebook use to be very cache unfriendly, people would force caching 
in order to get the images and posts not to be downloaded repeatedly by 
every user. Sicne a year or so ago FB supply proper cache controls to 
roll out their scrolling updates live and allow safe caching of images 
and timeline pages - all those patterns forcing long-term caching of FB 
responses are now only screwing up the users sidebars with live-feed 
content and causing user-A to download user-B email contacts lists etc 
on the wigit exporter APIs.
 For another example;
  squid-2.x and 3.x up to 3.1 are HTTP/1.0 and handle "no-cache" 
parameters according to HTTP/1.0. But squid-3.2 is HTTP/1.1 where 
"no-cache" means subtly different things. "ignore-no-cache" will now 
*reduce* the HIT ratio in a lot of traffic cases... 
http://squidproxy.wordpress.com/2012/10/16/squid-3-2-pragma-cache-control-no-cache-versus-storage/

Problem #2:
 Objects in HTTP/1.* are supposed to be delivered with instructions 
from the server about their existence, lifetime and storeage ability 
etc.

refresh_pattern is only designed to be the *backup* which tells Squid 
some parameters when they are not supplied by the server. min/max age to 
store things from, at what % of its lifetime to start testing with the 
server whether things are still fresh or not.
 It has been hacked about with ignore-*/override-* options to make the 
algorithm pretend that certain details were never supplied even if they 
were, or to outright replace the servers details in the traffic with 
something of your own.

Quite nasty in the ways they interact and VERY easy to get wrong when 
fiddling with some other developers website. For example; by using 
"ignore-private ignore-auth" you have declared that you know better than 
any of the developers at Microsoft or Apple whether they will ever be 
sending confidential information in private or authenticated traffic to 
certain domains. That MIGHT be right, but only they actually know 100% 
or what will be delivered marked 'private' so how can you be that sure? 
risky.

Anyhow, back to ...

Problem #3:
 Modern websites have a lot of dynamic content. If you check your logs 
you will I think find some traffic behaviours are very common.

* how many requests can you actually find which actually ask for 
index.* instead of just for "blahblah/" (note '/' at the end). In an 
ideal world they redirect, but it is more efficient to simply supply the 
page and they tend to do that.

* how many requests just fetch .JS / .CSS / .HTML with no '?' and 
parameters? The patterns for file types above assume (using 'blah$') 
that parameters are never sent.

... I think you will find that most of your requests are using these 
modern URL design techniques. So some pattern changes you WILL need:

1) remove those ([^.]+.|) at the start of your updater service 
patterns. Or better: replace them with the more easily understood  
^[a-z]+://

2) also in those per-domain patterns replace .com and .net with \.com 
and \.net to ensure the '.' is matched properly.

3) replace the $ with (\?.*)?$ at the end of your file-type patterns. 
This will allow them to actually match files with parameters passed to 
the server. Note that caching things when the server is known to be 
dynamic and is not supplying cacheability limits is an HTTP violation 
and can cause some sites (like facebook and docs.google) which present 
per-user script files from shared API URLs to screw up.

I want to be able to cache Windows Updates, Apple Updates and 
possibly Linux Repositories as well (without some other fancy program 
for that). I also want to be able to cache various Anti-Virus vendors 
sites, so updating virus signatures are a lot faster. And to be able 
to cache generic content such as images, media files and software.

Windows Updates is a difficult subject: 
http://wiki.squid-cache.org/SquidFaq/WindowsUpdate

Apple Updates I don't know much about. Either they supply accurate and 
useful caching controls or they don't, if they do absolutely DO NOT use 
refresh pattern to mangle them up. If they are like WU and doing weird 
stuff some close inspection is needed and a report of those details 
found would be very useful to a lot of people here ;-)

AV vendors tend to be cache friendly but do supply short lifetimes on 
their data. When you think about it that is a GOOD thing. In particular 
it is useless having your clients doing an AV update daily and getting 
last years definitions file. If anything use refresh_pattern to 
*shorten* long lifetimes on their packages, but normally you would not 
want that.

The same logic from AV goes for any updater service, caching the large 
files is usually fine but do be very careful about the ages and privacy 
details. Some D/L are signed with security keys per-user so a sending a 
file to other users from cache will only result in 'corrupt' downloads 
and huge amounts of traffic wasted.

I want to be able to do this all in Squid... But it seems 
useless....

Unless someone else has other suggestions? The only thing I'm seeing 
TCP_HIT's on is video content, as I'm using VideoCache. But this isn't 
enough... I want just about everything to be cached. My proxy server 
is a dedicated proxy system with 1 TB of hard drive space for caching.

My squid configuration can be viewed here: 
http://pastebin.com/XW5yZmvk

Problem #4:
 non-refresh_pattern controls on caching. As you can see from the 
WindowsUpdate wiki page there are a bunch of other controls directly 
affecting cache usage in Squid.

Amos