Re: I would like to use Squid for caching but it is imperative that all files be cached.

"Sheridan \"Dan\" Small" <dan@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 26 Apr 2011 20:44:33 +0100

Thanks for your reply Amos.

The tests are a suite of largely accessibility tests with some usability 
tests for web pages and other documents. Some are based on open source 
software, some are based on published algorithms, others (the 
problematic ones) are compiled executables. The tests are generally 
originally designed to test a single web page. I am however attempting 
to test entire large websites e.g. government websites or websites of 
large organisations. Data is to be collated from all tests on all web 
pages and other resources tested. This data is to be used to generate a 
report about the whole website not just individual pages.

Tests are largely automatic with some manual configuration of cookie and 
form data etc. They run on a virtual server. The virtual server is 
terminated after one job and only the report itself is kept. All runtime 
data including any cache is not retained after the one and only job.

A website e.g. that of a news organisation, can change within the time 
it takes to run the suite of tests. I want one static snapshot of each 
web page, one for each URL, to use as a reference and not have different 
tests reporting on different content for the same URL. I keep a copy of 
the web pages for reference within the report. (It would not be 
appropriate to keep multiple pages with the same URL in the report.) 
Some of the tests fetch documents linked to from the page being tested; 
therefore it is not possible to say which test will fetch a given file 
first.

Originally I thought of downloading the files once writing them to disk 
and processing them from the local copy. I even thought of using HTTrack 
( http://www.httrack.com/ ) to create a static copy of the websites. The 
problem with both these approaches is that I lose the HTTP header 
information. The header information is important as I would like to keep 
the test suite generic enough to handle different character encoding, 
content language and make sense of response codes. Also some tests 
complain if the header information is missing or incorrect.

So what I really want is a static snapshot of a dynamic website with 
correct HTTP header information. I know this is not what Squid was 
designed for but I was hoping that it it would be possible with Squid. 
Thus I thought I could use Squid to cache a static snapshot of the 
(dynamic) websites so that all the tests would run on the same content.

Of secondary importance is that the test suite is cloud based. The cloud 
service provider charges for bandwidth. If I can reduce repeat requests 
for the same file I can keep my costs down.