Re: I would like to use Squid for caching but it is imperative that all files be cached.

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Wed, 27 Apr 2011 12:10:23 +1200

On Tue, 26 Apr 2011 20:44:33 +0100, Sheridan "Dan" Small wrote:
Thanks for your reply Amos.

The tests are a suite of largely accessibility tests with some
usability tests for web pages and other documents. Some are based on
open source software, some are based on published algorithms, others
(the problematic ones) are compiled executables. The tests are
generally originally designed to test a single web page. I am however
attempting to test entire large websites e.g. government websites or
websites of large organisations. Data is to be collated from all 
tests
on all web pages and other resources tested. This data is to be used
to generate a report about the whole website not just individual
pages.

Tests are largely automatic with some manual configuration of cookie
and form data etc. They run on a virtual server. The virtual server 
is
terminated after one job and only the report itself is kept. All
runtime data including any cache is not retained after the one and
only job.

A website e.g. that of a news organisation, can change within the
time it takes to run the suite of tests. I want one static snapshot 
of
each web page, one for each URL, to use as a reference and not have
different tests reporting on different content for the same URL. I
keep a copy of the web pages for reference within the report. (It
would not be appropriate to keep multiple pages with the same URL in
the report.) Some of the tests fetch documents linked to from the 
page
being tested; therefore it is not possible to say which test will
fetch a given file first.

Originally I thought of downloading the files once writing them to
disk and processing them from the local copy. I even thought of using
HTTrack ( http://www.httrack.com/ ) to create a static copy of the
websites. The problem with both these approaches is that I lose the
HTTP header information. The header information is important as I
would like to keep the test suite generic enough to handle different
character encoding, content language and make sense of response 
codes.
Also some tests complain if the header information is missing or
incorrect.

So what I really want is a static snapshot of a dynamic website with
correct HTTP header information. I know this is not what Squid was
designed for but I was hoping that it it would be possible with 
Squid.
Thus I thought I could use Squid to cache a static snapshot of the
(dynamic) websites so that all the tests would run on the same
content.

Of secondary importance is that the test suite is cloud based. The
cloud service provider charges for bandwidth. If I can reduce repeat
requests for the same file I can keep my costs down.

Hmm, okay. Squid or any proxy is not quite the right tool to use for 
this. The software is geared around providing those latest greatest 
revalidated versions of everything on request.

I think the spider idea was the right track to go down...

We provide a tool called squidclient which is kept relatively simple 
and optimized for integration with test-suites like this. Its output is 
a raw dump of the full HTTP reply headers and body, with configurable 
request header and source IP display as well. It can connect and run 
directly on any HTTP web service (applet, server or proxy) but does 
require a proxy to gateway FTP and HTTP services.

An alternative I'm nearly as fond of is wget. It can save the headers 
to a separate file from the binary object and has many configuration 
options for spidering the web.

curl is another popular one possibly worth your while looking at, I'm 
not very familiar with it though.

Amos