Re: caching data for thousands of nodes in a compute cluster

"Pablo García" <malevo@xxxxxxxxx> · Mon, 11 Jun 2007 17:57:53 -0300

Dave, If you can configure a farm of squid nodes and a bunch of web
servers to serve as origin servers, then you could use a load balancer
appliance (Citrix Netscaler, F5, Cisco, etc), that, by using a URLHASH
based algorithm it would request the same url from the same squid and
in case of a crash in that node, you would have only one cache miss.

Regards, Pablo

On 6/11/07, Dave Dykstra <dwd@xxxxxxxx> wrote:
Hi,

I have been thinking about the problem of quickly distributing objects
to thousands of jobs in a compute cluster (for high energy physics).  We
have multiple applications that need to distribute the same data to lots
of different jobs: some applications distributing hundreds of megabytes
to thousands of jobs and some distributing gigabytes of data to hundreds
of jobs.  It quickly becomes impractical to distribute all the data from
just a few nodes running squid, so I am thinking about running squid on
every node, especially as the number of CPU cores per node increases.
The problem then is how to determine which peer to get data from.  As
far as I can tell, none of the methods currently supported by squid
would work very well with thousands of squids (especially considering
that there would often be a small number of them that are out of service
so it would be hard to statically configure them).  Am I right about
that?  It seems to me that it would work better if there were a couple
of nodes that could dynamically keep track of which nodes had which
objects (over a certain size), and could direct requests to other nodes
that had the objects or were in the process of getting them.  It's quite
a bit like the approach that peer-to-peer systems like bittorrent use,
although I haven't found any existing implementations that would be
appropriate for this application and I think it is probably more
appropriate to extend squid.

- Dave Dykstra