Re: refresh pattern questions

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Mon, 15 Jul 2013 15:57:07 +1200

On 15/07/2013 6:31 a.m., Joshua B. wrote:
I have some questions related to refresh pattern options

First, since “no-cache” now seems in-effective with http 1.1, what would be a possible way to force an object to cache using both standards of html 1.0 and 1.1? If it’s not possible, then is there any plans to implement in a future version of squid?

You are talking about "ignore-no-cache"?  I'm not sure you understand 
exactly what it did and what the new Squid do instead.

Simply put:
  There is *no* HTTP/1.0 equivalent for "no-cache" on responses. The 
best one can do is set an Expires header.

Squid-2.6 to 3.1 had some small HTTP/1.1 support but were unable to 
perform the tricky revalidation required for handling "no-cache" 
responses properly so they used to treat no-cache as if it were 
"no-store" and prevent caching of those responses.

 ==> "ignore-no-cache" used to flip that behaviour and cause them to be 
stored. This resulted in a great many objects being cached for long 
periods and re-sent to clients from cached copies which were outdated 
and might cause big UX problems (thus the warnin when it was used).

Squid-3.2 and later have far better HTTP/1.1 support *including* the 
ability to revalidate "no-cache" responses properly. So these versions 
of Squid *do* store the responses with "no-cache" by default. They then 
send an IMS request to the server to verify the HIT is up-to-date - 
resolving all those UX problems.
  ==> the useful effect of "ignore-no-cache" does not need any config 
option now, and the bad side-effects ... do you really want them?

**  If you have a server and want to follow the "old" behaviour of 
no-cache responses. You should already have been using "no-store" instead.

** If you have a server and want to follow the "old" behaviour when 
"ignore-no-cache" was used. You should not have been sending "no-cache" 
on responses to begin with.

Secondly, why is there a limit of 1 year on an “override” method? A lot of websites make it such a pain to cache, and even go as far as (literally) setting the date of their files back to the early 1900s. Them doing this makes it feel impossible to cache the object, especially with squids own limitation.

To prevent 32-bit overflow on the numerics inside Squid. Going much 
further out the number inverts and you end up with objects being evicted 
from cache instead of stored.
The whole refresh_pattern calculations need to be 64-bit upgraded and 
the override-* and ignore-* options reviewed as to what they do versus 
what the 1.1 spec allows to happen by default (like no-cache just got done).

You ever wonder why those websites go to such extreme lengths? Why they 
care so much about their client getting recently updated content?

With all this said, IS there an effective way to cache content when the server doesn’t want you to? So there would be like, a GAURANTEED “tcp_hit” in the log. Even with a ? in the url of the image, so squid would consider anything with a ? after it the same image. For example: website.com/image.jpg?1234567890
It's the exact same image (I've examined all in the logs that look like this), but they're making it hard to cache with the ? in the url, so I'd like to know if there's a way around this?

1) Remove any squid.conf "QUERY" ACL and related "cache deny" settings 
which Squid-2.6 and earlier required.
 That includes the hierarchy_stoplist patterns. These are the usual 
cause of dynamic content not caching in Squid-2.7+.

2) Try out the upcoming 3.4 (3.HEAD right now) Store-ID feature for 
de-duplicating cache content.
You can also in older versions re-write the URL to strip the numerics. 
In some ways this is safer as the backend then becomes aware of the 
alteration and smart ones can take special action to prevent any massive 
problems if you accidentally collide with a security system (see below).

How do you know that "website.com/image.jpg?1234567890" is not ...
 ... a part of some captcha-style security system?
 ... the background image for a login button which contains the users name?
 ... an image-written bank account number?
 ... an image containing some other private details?
 ... a script with dynamic references to other URLs?

To be sure that you don't make that type of mistake with all the many, 
many ways of using URLs you have to audit *every single link on every 
single website which your regex pattern matches*, .. or do the easy 
thing and let HTTP caching controls work as they are supposed to work. 
Send and annoyed email to the site in question requesting that they fix 
their URL scheme, highlighting that they get *free* bandwidth in 
exchange for the fix. Sites do change - Facebook is a good case study to 
point at: as they scaled up they had to fix their cacheability and 
HTTP/1.1 compliance to help save costs exploding.

Amos