Re: cache dynamically generated images

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Wed, 23 Feb 2011 11:24:50 +1300

On Tue, 22 Feb 2011 11:26:51 -0500, Charles Galpin wrote:
Hi Amos, thanks so much for the help. More questions and
clarification needed please

On Feb 18, 2011, at 5:47 PM, Amos Jeffries wrote:

Make sure your config has had these changes:
 http://wiki.squid-cache.org/ConfigExamples/DynamicContent

which allows Squid to play with query-string (?) objects properly.

Yes these were default settings for me.  I don't think this is
necessarily an issue for me though since I am sending URLs that look
like static image requests, but converting them via mod_rewrite in
apache to call my script.

TCP_REFRESH_MISS means the backend sent a new changed copy while 
revalidating/refreshing its existing copy.

max-age=0 means revalidate that is has not changed before sending 
anything.

>  I have set an Expires, Etag, "Cache-Control: =
max-age=3D600, s-max-age=3D600, must-revalidate", "Content-Length 
and =

must-revalidate from the server is essentially the same as max-age=0 
form the client. It will also lead to TCP_REFRESH_MISS.

I'll admit I threw in the must-revalidate as part of my incfreasingly
desperate attempts to get things behaving the way I wanted,  and
didn't fully understand it's ramifications, nor the client side
max-age=0 implications, but your explanation helps!

BUT, these controls are only what is making the problem visible. The 
server logic itself is the actual problem.

Agreed!

ETag should be the MD5 checksum of the file or something similarly 
unique. It is used alongside the URL to guarantee version differences 
are kept separate.

Yes, this was another desperate attempt to force caching to occur,
and will implement something more sane for the actual app. But this
should have helped shouldn't it? For my testing this should have
uniquely identified this image right?

I guess I have a fundamental mis-understanding, but my assumption was
all these directives were ways to tell squid to not keep asking the
origin, but server from the cache until the age expired and at that
point check if it changed. I totally didn't expect it to check every
time, and this still doesn't sit well with me. Should it really check
every time? I know a check is faster than an actual GET but it still
seems more than necessary if caching parameters have been specified.

Your approach is reasonable for your needs. But the backend server 
system is letting you down by sending back a new copy every 
validation.
If you can get it to present 304 not-modified responses between file 
update times this will work as intended.

This would mean implementing some extra logic in the script to 
handle If-Modified-Since, If-Unmodified-Since, If-None-Match and 
If-Match headers.
 The script itself needs to be in control of whether a local static 
duplicate is used, apache does not have enough info to do it as you 
noticed. Most CMS call this server-side caching.

Ok, I can return 304 and it gets  a cache hit as expected so this is
great. I am not sure I'll waste any time making my test script any
smarter as it's just a simple perl script and the actual
implementation will be in java and be able to make these
determinations, but one of the things that has been throwing me off,
is I see no signs in the apache logs of a HEAD request, they all show
up as GETs. I assume this is my mod_rewrite rule, but I also tried
with a direct url to the script and am not getting the
If-Modified-Since header for example (the only one I know off the top
of my head is set by the CGI module).

Correct. This is a RESTful property of HTTP.
HEAD is for systems to determine the properties of an object when they 
*never* want the body to come back as the reply.  Re-validation requests 
do want changed bodies to come back when relevant so they use GET with 
If-* headers.

But either way, this confirms it's just my dumb script to blame :)

Cool, good to know its easily fixed.

Lastly, I was unable to setup squid on an alternate port - say 
8081, and =
use an existing apache on port 80, both on the same box. This is 
for =
testing so I can run squid in parallel with the existing service 
without =
changing the port it is on.  Squid seems to want to use the same 
port =
for the origin server as itself and I can't figure out how to say =
"listen in 8081 but send requests to port 80 of the origin server". 
Any =
thoughts on this? I am using another server right now to get around 
=
this, but it would be more convenient to use the same box.

cache_peer parameter #3 is the port number on the origin server to 
send HTTP requests to.

Also, to make the Host: header and URL contain the right port number 
when crossing ports like this you need to set the http_port vport=X 
option to the port the backend-server is using. Otherwise Squid will 
place its public-facing port number in the Host: header to inform the 
backend what the clients real URL was.

Yes I have this but it's still not working. Below are all uncommented
lines in my squid.conf - can you see anything I have that's messing
this up? The imageserver.my.org is an apache virtual host if it
matters. With this, if I go to
http://imageserver.my.org:8081/my/image/path.jpg , squid calls
http://imageserver.my.org:8081/my/image/path.jpg instead of
http://imageserver.my.org:80/my/image/path.jpg

Hmm, that is a bit of a worry. vport=80 is supposed to be fixing that 
port number up so it disappears completely (implicit :80).

acl all src all
acl manager proto cache_object
acl localhost src 127.0.0.1/32
acl to_localhost dst 127.0.0.0/8 0.0.0.0/32
acl http8081 port 8081
acl local-servers dstdomain .my.org
acl localnet src 10.0.0.0/8     # RFC1918 possible internal network
acl localnet src 172.16.0.0/12  # RFC1918 possible internal network
acl localnet src 192.168.0.0/16 # RFC1918 possible internal network
acl SSL_ports port 443
acl Safe_ports port 80          # http
acl Safe_ports port 8081          # http
acl Safe_ports port 21          # ftp
acl Safe_ports port 443         # https
acl Safe_ports port 70          # gopher
acl Safe_ports port 210         # wais
acl Safe_ports port 1025-65535  # unregistered ports
acl Safe_ports port 280         # http-mgmt
acl Safe_ports port 488         # gss-http
acl Safe_ports port 591         # filemaker
acl Safe_ports port 777         # multiling http
acl CONNECT method CONNECT

For a reverse proxy it is a good idea to place http_access and ACL 
controls specific to the reverse proxy at this point in the file.

What I would add here for your config is this:

 acl imageserver dstdomain imageserver.my.org
 http_access allow imageserver

NP: this obsolete the http8081 limits.

http_access allow manager localhost
http_access deny manager
http_access deny !Safe_ports
http_access deny CONNECT !SSL_ports
http_access allow localnet
http_access allow http8081
http_access deny all
icp_access allow localnet
icp_access deny all
http_port 8081 vhost vport=80 defaultsite=imageserver.my.org

Optional with your Squid, but for future-proofing the upgrade you can 
add the accel mode flag explicitly as first on the options list:

 http_port 8081 accel vhost vport=80 defaultsite=imageserver.my.org

cache_peer imageserver.my.org parent 80 0 no-query originserver 
default
hierarchy_stoplist cgi-bin ?
access_log c:/squid/var/logs/access.log squid
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320
acl shoutcast rep_header X-HTTP09-First-Line ^ICY.[0-9]
upgrade_http0.9 deny shoutcast
acl apache rep_header Server ^Apache
broken_vary_encoding allow apache
always_direct allow all
always_direct allow local-servers

Absolutely remove the always_direct if you can. The "allow imageserver" 
line I recommend above will ensure that the website requests always are 
serviced. Squid will pass them on to the cache_peer securely *unless* 
always_direct bypasses the peer link.

FWIW: When the cache_peer is configured with a FQDN the IPs are looked 
up on every request needing to go there. So a small amount of IP load 
balancing and failover happens there already, the same as you get from 
going direct based on the vhost name.

Amos