Dave, If you can configure a farm of squid nodes and a bunch of web servers to serve as origin servers, then you could use a load balancer appliance (Citrix Netscaler, F5, Cisco, etc), that, by using a URLHASH based algorithm it would request the same url from the same squid and in case of a crash in that node, you would have only one cache miss. Regards, Pablo On 6/11/07, Dave Dykstra <dwd@xxxxxxxx> wrote:
Hi, I have been thinking about the problem of quickly distributing objects to thousands of jobs in a compute cluster (for high energy physics). We have multiple applications that need to distribute the same data to lots of different jobs: some applications distributing hundreds of megabytes to thousands of jobs and some distributing gigabytes of data to hundreds of jobs. It quickly becomes impractical to distribute all the data from just a few nodes running squid, so I am thinking about running squid on every node, especially as the number of CPU cores per node increases. The problem then is how to determine which peer to get data from. As far as I can tell, none of the methods currently supported by squid would work very well with thousands of squids (especially considering that there would often be a small number of them that are out of service so it would be hard to statically configure them). Am I right about that? It seems to me that it would work better if there were a couple of nodes that could dynamically keep track of which nodes had which objects (over a certain size), and could direct requests to other nodes that had the objects or were in the process of getting them. It's quite a bit like the approach that peer-to-peer systems like bittorrent use, although I haven't found any existing implementations that would be appropriate for this application and I think it is probably more appropriate to extend squid. - Dave Dykstra