Search squid archive

Re: Accelerating proxy not matching cgi files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/08/11 07:43, Mateusz Buc wrote:
Hello,

at the beginning I would like to mention that I've already search for
the answer to my question, found similar topics, but none of them
helped me to completely solve my problem.

The thing is I have monitoring server with cgi scripted site on it.
The site fetches various data and generates charts 'on-the-fly'. Now
it is only available via HTTPS with htaccess-type authorization.

The bad thing is that it is quite often browsed and everytime it gets
HTTP requests, it has to generate all of the charts (quite a lot of
them) on the fly, which not only makes loading the page slow, but also
affects server's performance.

There 4 most important things about the site:
* index.cgi - checks current timestamps and generates proper GET
requests to generate images via gen.cgi

You have an internal part of your site performing GET requests?

Or did you mean it generates an index page containing a set of volatile URLs for IMG or A tags?

* gen.cgi - it receives paramers via GET from index.cgi and draws charts
* images ARE NOT files placed on server, but form of gen.cgi links
(e.g. "gen.cgi?icon,moni_sys_procs,1314022200,1,161.6,166.4,FFFFFF...")

Does not matter to Squid where the files are. Or even that they are files.

* images generation links contain most up-to-date timestamp for every
certain image

Ouch. Bad, bad, bad for caching. Caches only works when the URLs are stable with repeated calls to the same ones.

To get good caching the URL design should only contain parameters relevant to where the data is sourced or its content structure. Whatever details could produce two different objects in two _simultaneous_ parallel requests. Everything else is potential problems.

The reason you may want times to be in the URL is for a wayback machine where time is an important static coordinate in the location of something.


Your Fix part 1: - sending cache friendly meta data.

** Send "Cache-Control: must-revalidate" to require all browsers and caches to double-check their content instead of making guesses about storage times.

** Send "Last-Modified: " with HTTP formated timestamp. Correct format is important.

At this point incoming requests will either be requesting brand new content or have an If-Modified-Since: header containing the cached objects Last-Modified: timestamp.

NOTE: You will not _yet_ see any reduction in the 200 requests. Potentially you might actually see an increase as "must-revalidate" causes middleware caches to start working better.


Your Fix part 2: - reducing CPU intensive 200 responses.

It is up to your gen.cgi whether it responds quickly with a simple 304 no-change, or creates a whole new object for a 200 reply. This decision can now be based on the If-* information the clients are sending as well as the URL.

 ** Pull the timestamp from If-Modified-Since header instead of the URL.
  - If there is no such header the client is requiring a new graph.
- If the timestamp matches or is newer than the graph the URL describes, send 304 instead.

** Remove the timestamp completely from your URLs unless you want that wayback ability. In which case you may as well make it visible and easy for people to type in URLs manually for particular fetches.

At this point your gen.cgi script starts producing a mix of fast 304 responses amidst the slow 200 ones and both your bandwidth and CPU graphs should drop.


Your fix part 3: - KISS simplicity

Your URLs should now be changing far less, possibly even to the point that they are completely static. The less URLs change the better caching efficiency you get.

Your index.cgi can be made simpler now or possibly replaced with a static page. It only needs to change when the type or location changes and affect the graph URLs.



As a followup you can move on to experiments with the other cache control headers like max-age to find values (per-graph URL) suitable for avoiding gen.cgi calls completely for a suitable period.


If you are able to generate an ETag value and validate it easily without much work. For example Etag as MD5 hash of a raw un-graphed data file, or a hash of the URL+timestamp. Then you should also add ETag and other If-* header support to the scripts. That would allow several more powerful caching features to be used by Squid on top of the simple 304/200 savings. Such as partial ranges and compression.


What I want to do is to set another server in the middle, which would
run squid and act as a transparent, accelerating proxy. My main
problem is that squid doesn't want to cache anything at all. My goal
is to:

* cache index.cgi for max 1 minute time - since it provides important
data to generate charts
* somehow cache images generated on the fly as long, as there aren't
new one in index.cgi (only possible if timestamp has changed)

To make it simpler to develop, I've temporary disabled authorization,
so my config looks like:
#################################################################
http_port 5080 accel defaultsite=xxxx.pl ignore-cc

# HTTP peer
cache_peer 11.11.11.11 parent 5080 0 no-query originserver name=xxxx.pl

hierarchy_stoplist cgi-bin cgi ?

The above config line prevents the cache_peer source being used for URLs containing those strings. You can safely drop the line.


refresh_pattern (\.cgi|\?)    0       0%      0

Okay. Check the case sensitivity of your web server, if its not case sensitive you will need to re-add the -i to prevent XSS problems.

refresh_pattern .               0       20%     4320

acl our_sites dstdomain xxxx.pl
http_access allow our_sites
cache_peer_access xxxx.pl allow our_sites
cache_peer_access xxxx.pl deny all
##################################################################

Unfortunately, access.log looks in this way:

1314022248.996     66 127.0.0.1 TCP_MISS/200 432 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
1314022249.041     65 127.0.0.1 TCP_MISS/200 491 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
1314022249.057     65 127.0.0.1 TCP_MISS/200 406 GET
http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png

NP: every unique URL is a different object in HTTP. Cache revalidation cannot compare object at URL A against object at URL B. Only the origin can do that sort of thing, and yours always produces 200 when asked.


Could someone tell me how to configure squid to meet my expactations?

Squid is configured by default to meet your expectations about caching. It just requires sensible cache-friendly output from the server scripts. See above.


Some great tutorials on URL design and working with caching can be found at:
  http://warpspire.com/posts/url-design/
  http://www.mnot.net/cache_docs/

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10


[Index of Archives]     [Linux Audio Users]     [Samba]     [Big List of Linux Books]     [Linux USB]     [Yosemite News]

  Powered by Linux