On 29.11.2012 13:31, Joshua B. wrote:
I'm beginning to conclude that refresh pattern in Squid is useless.
I had a neat refresh pattern which is supposed to help cache just
about everything, below:
refresh_pattern
([^.]+\.)?(download|(windows)?update)\.(microsoft\.)?com/.*\.(cab|exe|msi|msp|psf)
4320 100% 43200 override-expire reload-into-ims ignore-reload
ignore-no-cache ignore-private ignore-auth ignore-no-store
refresh_pattern
([^.]+.|)(download|adcdownload).(apple.|)com/.*\.(pkg|dmg) 4320 100%
43200 override-expire reload-into-ims ignore-reload ignore-no-cache
ignore-private ignore-auth ignore-no-store
The above pattern matches:
.*download.com/\.(pkg|dmg)
There are no limits on where in the URL that string may occur ... <img
src="http://example.com?downloadXcom/.pkg" /> ... ouch.
refresh_pattern ([^.]+.|)avg.com/.*\.(bin) 4320 100% 43200
reload-into-ims
refresh_pattern ([^.]+.|)spywareblaster.net/.*\.(dtb) 4320 100%
64800 reload-into-ims
refresh_pattern ([^.]+.|)symantecliveupdate.com/.*\.(zip|exe) 43200
100% 43200 reload-into-ims
refresh_pattern ([^.]+.|)avast.com/.*\.(vpu|vpaa) 4320 100% 43200
reload-into-ims
refresh_pattern (avgate|avira).*(idx|gz)$
1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload
reload-into-ims
refresh_pattern kaspersky.*\.avc$
1440 999999% 10080 ignore-no-cache ignore-no-store ignore-reload
reload-into-ims
Problem #1:
There are limits on how large the numbers can be. Newer ones will
check for integer overflow when 999999/100 is multiplied by the age in
seconds of years-old objects and truncate it to 1 year forward storage.
Older squid wold just let you store for negative amounts of time if the
overflow went badly ... negative values means stale/discard/MISS in
refresh calculations.
refresh_pattern -i \.(gif|png|jpg|jpeg|ico)$ 10080 90% 43200
override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.(iso|avi|wav|mp3|mp4|mpeg|swf|flv|x-flv)$ 43200
90% 432000 override-expire ignore-no-cache ignore-no-store
ignore-private
refresh_pattern -i
\.(deb|rpm|exe|zip|tar|tgz|ram|rar|bin|ppt|doc|tiff)$ 10080 90% 43200
override-expire ignore-no-cache ignore-no-store ignore-private
refresh_pattern -i \.index.(html|htm)$ 0 40% 10080
refresh_pattern -i \.(html|htm|css|js)$ 1440 40% 40320
----
And it barely cached any of the content it's supposed to. I never
once saw "TCP_HIT" in the logs.
And it seems like when I removed these refresh patterns (leaving the
defaults), I finally saw TCP_HIT's in the log file...
So is refresh pattern useless? Or am I just doing this wrong??!!
No and maybe.
Overall it is a good idea NOT to use refresh_pattern unless you have
to. And definitely NOT to use the ignore/override options unless you
have a very specific reason fro each one with some good research to back
up why you need it. Sites change over time and so does Squid
behaviour...
For example;
facebook use to be very cache unfriendly, people would force caching
in order to get the images and posts not to be downloaded repeatedly by
every user. Sicne a year or so ago FB supply proper cache controls to
roll out their scrolling updates live and allow safe caching of images
and timeline pages - all those patterns forcing long-term caching of FB
responses are now only screwing up the users sidebars with live-feed
content and causing user-A to download user-B email contacts lists etc
on the wigit exporter APIs.
For another example;
squid-2.x and 3.x up to 3.1 are HTTP/1.0 and handle "no-cache"
parameters according to HTTP/1.0. But squid-3.2 is HTTP/1.1 where
"no-cache" means subtly different things. "ignore-no-cache" will now
*reduce* the HIT ratio in a lot of traffic cases...
http://squidproxy.wordpress.com/2012/10/16/squid-3-2-pragma-cache-control-no-cache-versus-storage/
Problem #2:
Objects in HTTP/1.* are supposed to be delivered with instructions
from the server about their existence, lifetime and storeage ability
etc.
refresh_pattern is only designed to be the *backup* which tells Squid
some parameters when they are not supplied by the server. min/max age to
store things from, at what % of its lifetime to start testing with the
server whether things are still fresh or not.
It has been hacked about with ignore-*/override-* options to make the
algorithm pretend that certain details were never supplied even if they
were, or to outright replace the servers details in the traffic with
something of your own.
Quite nasty in the ways they interact and VERY easy to get wrong when
fiddling with some other developers website. For example; by using
"ignore-private ignore-auth" you have declared that you know better than
any of the developers at Microsoft or Apple whether they will ever be
sending confidential information in private or authenticated traffic to
certain domains. That MIGHT be right, but only they actually know 100%
or what will be delivered marked 'private' so how can you be that sure?
risky.
Anyhow, back to ...
Problem #3:
Modern websites have a lot of dynamic content. If you check your logs
you will I think find some traffic behaviours are very common.
* how many requests can you actually find which actually ask for
index.* instead of just for "blahblah/" (note '/' at the end). In an
ideal world they redirect, but it is more efficient to simply supply the
page and they tend to do that.
* how many requests just fetch .JS / .CSS / .HTML with no '?' and
parameters? The patterns for file types above assume (using 'blah$')
that parameters are never sent.
... I think you will find that most of your requests are using these
modern URL design techniques. So some pattern changes you WILL need:
1) remove those ([^.]+.|) at the start of your updater service
patterns. Or better: replace them with the more easily understood
^[a-z]+://
2) also in those per-domain patterns replace .com and .net with \.com
and \.net to ensure the '.' is matched properly.
3) replace the $ with (\?.*)?$ at the end of your file-type patterns.
This will allow them to actually match files with parameters passed to
the server. Note that caching things when the server is known to be
dynamic and is not supplying cacheability limits is an HTTP violation
and can cause some sites (like facebook and docs.google) which present
per-user script files from shared API URLs to screw up.
I want to be able to cache Windows Updates, Apple Updates and
possibly Linux Repositories as well (without some other fancy program
for that). I also want to be able to cache various Anti-Virus vendors
sites, so updating virus signatures are a lot faster. And to be able
to cache generic content such as images, media files and software.
Windows Updates is a difficult subject:
http://wiki.squid-cache.org/SquidFaq/WindowsUpdate
Apple Updates I don't know much about. Either they supply accurate and
useful caching controls or they don't, if they do absolutely DO NOT use
refresh pattern to mangle them up. If they are like WU and doing weird
stuff some close inspection is needed and a report of those details
found would be very useful to a lot of people here ;-)
AV vendors tend to be cache friendly but do supply short lifetimes on
their data. When you think about it that is a GOOD thing. In particular
it is useless having your clients doing an AV update daily and getting
last years definitions file. If anything use refresh_pattern to
*shorten* long lifetimes on their packages, but normally you would not
want that.
The same logic from AV goes for any updater service, caching the large
files is usually fine but do be very careful about the ages and privacy
details. Some D/L are signed with security keys per-user so a sending a
file to other users from cache will only result in 'corrupt' downloads
and huge amounts of traffic wasted.
I want to be able to do this all in Squid... But it seems
useless....
Unless someone else has other suggestions? The only thing I'm seeing
TCP_HIT's on is video content, as I'm using VideoCache. But this isn't
enough... I want just about everything to be cached. My proxy server
is a dedicated proxy system with 1 TB of hard drive space for caching.
My squid configuration can be viewed here:
http://pastebin.com/XW5yZmvk
Problem #4:
non-refresh_pattern controls on caching. As you can see from the
WindowsUpdate wiki page there are a bunch of other controls directly
affecting cache usage in Squid.
Amos