Re: Generell Squid setup

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Fri, 31 Aug 2012 01:26:57 +1200

On 30/08/2012 2:13 a.m., Farkas H wrote:
Hi Amos,
thanks for your response.
My part is the web server in the middle [WS] providing services to
process data. Users send requests via http-post with embedded http-get
requests to the web server. I don't want to touch this for the moment.

The web server sends the embedded http-get requests to remote servers
(not mine), receives the requested data, processes the data and
returns the result.
I want to cache the data of the remote servers. I think it's necessary
to redirect the http-get output of the web server to Squid. I would
say Squid should be behind the web server and not in front like a
reverse proxy but I'm not a specialist. What is your opinion? Is there
a chance to do this (without coding)?
I appreciate any advice.
Thanks, Farkas

Squid does not do any semantic re-writing such as you are wanting. It is 
too dangerous in HTTP, even if you happen to have stumbled on a 
application which does not break when its done.

Your proposed configuration #1 should do what you want (#2 will not).

Or an ICAP service doing the [WS] re-writing operation and supplying 
Squid with adapted requests will might be another way to achieve this. 
But I'm not so sure about ICAP being workable either, POST are marked 
non-cachale by Squid for now.

Amos

On 28 August 2012 12:20, Amos Jeffries <squid3@xxxxxxxxxxxxx> wrote:
On 25/08/2012 8:41 a.m., Farkas H wrote:
Hi list,

I'm a little confused about the various configuration options of
Squid. I have the following setup:
Internet clients <-> remote Web server [WS] <-> different remote Web
servers [R1], ..., [Rn]
[WS] processes the data; [R1], ..., [Rn] provide the data

The clients send requests via http-post to [WS].
[WS] translates the requests and retrieves the required data from
[R1], ..., [Rn] via http-get. [WS] processes the data and sends the
responses to the clients.

The (requests of [WS] and) the responses of [R1], ..., [Rn] should be
cached (inside [WS] surrounding).
The number of web servers [R1], ..., [Rn] is relatively small. This
should lead to many cache hits.

Cache HITs is related to URL space range, not server count. For example
Wikipedia has a great many servers all serving the same content, they get
HIT ratio up near 100% sometimes since the client requested URLs are all for
the one website and usually some "trending" articles.

But since these are "delivery" operations which are being cached and served
from cache ... the server will never receive the HITs, will never be able to
update its state according to their receipt. Resulting in possibly very
broken, very client-visible behaviours unintended by the site designer(s).

I have two suggestions for discussion:
(1) normal Squid cache; [WS] acts as a kind of client; [WS] is the
only client of Squid Proxy; the requests of [WS] would have to be
redirected programmatically to Squid Proxy,
(2) reverse proxy (with httpd-accelerator mode).

Are these options suitable? Which (other) squid setup would you recommend?
Is (1) possible without programming?
Which configuration (from http://wiki.squid-cache.org/ConfigExamples)
should be chosen for (1) or (2)?

Do you own those websites or are providing CDN services to their owners?
choose (2) - it will pass through the requests unchanged.

Are you ISP for those clients? choose (1), but...

Are you aware of the difference between HTTP POST and GET semantics? and how
that determins very different caching, security, and failure recovery
models?
  Why are you re-writing these critical semantics in a relay?

Amos