Re: Accelerating proxy not matching cgi files

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Tue, 23 Aug 2011 18:53:17 +1200

On 23/08/11 07:43, Mateusz Buc wrote:
Hello,

at the beginning I would like to mention that I've already search for
the answer to my question, found similar topics, but none of them
helped me to completely solve my problem.

The thing is I have monitoring server with cgi scripted site on it.
The site fetches various data and generates charts 'on-the-fly'. Now
it is only available via HTTPS with htaccess-type authorization.

The bad thing is that it is quite often browsed and everytime it gets
HTTP requests, it has to generate all of the charts (quite a lot of
them) on the fly, which not only makes loading the page slow, but also
affects server's performance.

There 4 most important things about the site:
* index.cgi - checks current timestamps and generates proper GET
requests to generate images via gen.cgi

You have an internal part of your site performing GET requests?

Or did you mean it generates an index page containing a set of volatile 
URLs for IMG or A tags?

* gen.cgi - it receives paramers via GET from index.cgi and draws charts
* images ARE NOT files placed on server, but form of gen.cgi links
(e.g. "gen.cgi?icon,moni_sys_procs,1314022200,1,161.6,166.4,FFFFFF...")

Does not matter to Squid where the files are. Or even that they are files.

* images generation links contain most up-to-date timestamp for every
certain image

Ouch. Bad, bad, bad for caching. Caches only works when the URLs are 
stable with repeated calls to the same ones.

 To get good caching the URL design should only contain parameters 
relevant to where the data is sourced or its content structure. Whatever 
details could produce two different objects in two _simultaneous_ 
parallel requests. Everything else is potential problems.

The reason you may want times to be in the URL is for a wayback machine 
where time is an important static coordinate in the location of something.

Your Fix part 1: - sending cache friendly meta data.

** Send "Cache-Control: must-revalidate" to require all browsers and 
caches to double-check their content instead of making guesses about 
storage times.

** Send "Last-Modified: " with HTTP formated timestamp. Correct format 
is important.

At this point incoming requests will either be requesting brand new 
content or have an If-Modified-Since: header containing the cached 
objects Last-Modified: timestamp.

 NOTE: You will not _yet_ see any reduction in the 200 requests. 
Potentially you might actually see an increase as "must-revalidate" 
causes middleware caches to start working better.

Your Fix part 2: - reducing CPU intensive 200 responses.

 It is up to your gen.cgi whether it responds quickly with a simple 304 
no-change, or creates a whole new object for a 200 reply.
 This decision can now be based on the If-* information the clients are 
sending as well as the URL.

 ** Pull the timestamp from If-Modified-Since header instead of the URL.
  - If there is no such header the client is requiring a new graph.
  - If the timestamp matches or is newer than the graph the URL 
describes, send 304 instead.

 ** Remove the timestamp completely from your URLs unless you want that 
wayback ability. In which case you may as well make it visible and easy 
for people to type in URLs manually for particular fetches.

At this point your gen.cgi script starts producing a mix of fast 304 
responses amidst the slow 200 ones and both your bandwidth and CPU 
graphs should drop.

Your fix part 3: - KISS simplicity

Your URLs should now be changing far less, possibly even to the point 
that they are completely static. The less URLs change the better caching 
efficiency you get.

Your index.cgi can be made simpler now or possibly replaced with a 
static page. It only needs to change when the type or location changes 
and affect the graph URLs.

As a followup you can move on to experiments with the other cache 
control headers like max-age to find values (per-graph URL) suitable for 
avoiding gen.cgi calls completely for a suitable period.

If you are able to generate an ETag value and validate it easily without 
much work. For example Etag as MD5 hash of a raw un-graphed data file, 
or a hash of the URL+timestamp.  Then you should also add ETag and other 
If-* header support to the scripts.
 That would allow several more powerful caching features to be used by 
Squid on top of the simple 304/200 savings. Such as partial ranges and 
compression.

What I want to do is to set another server in the middle, which would
run squid and act as a transparent, accelerating proxy. My main
problem is that squid doesn't want to cache anything at all. My goal
is to:

* cache index.cgi for max 1 minute time - since it provides important
data to generate charts
* somehow cache images generated on the fly as long, as there aren't
new one in index.cgi (only possible if timestamp has changed)

To make it simpler to develop, I've temporary disabled authorization,
so my config looks like:
#################################################################
http_port 5080 accel defaultsite=xxxx.pl ignore-cc

# HTTP peer
cache_peer 11.11.11.11 parent 5080 0 no-query originserver name=xxxx.pl

hierarchy_stoplist cgi-bin cgi ?

The above config line prevents the cache_peer source being used for URLs 
containing those strings. You can safely drop the line.

refresh_pattern (\.cgi|\?)    0       0%      0

Okay. Check the case sensitivity of your web server, if its not case 
sensitive you will need to re-add the -i to prevent XSS problems.

refresh_pattern .               0       20%     4320

acl our_sites dstdomain xxxx.pl
http_access allow our_sites
cache_peer_access xxxx.pl allow our_sites
cache_peer_access xxxx.pl deny all
##################################################################

Unfortunately, access.log looks in this way:

1314022248.996     66 127.0.0.1 TCP_MISS/200 432 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
1314022249.041     65 127.0.0.1 TCP_MISS/200 491 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
1314022249.057     65 127.0.0.1 TCP_MISS/200 406 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png

NP: every unique URL is a different object in HTTP. Cache revalidation 
cannot compare object at URL A against object at URL B. Only the origin 
can do that sort of thing, and yours always produces 200 when asked.

Could someone tell me how to configure squid to meet my expactations?

Squid is configured by default to meet your expectations about caching. 
It just requires sensible cache-friendly output from the server scripts. 
See above.

Some great tutorials on URL design and working with caching can be found at:
  http://warpspire.com/posts/url-design/
  http://www.mnot.net/cache_docs/

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10