Re: caching data for thousands of nodes in a compute cluster

Dave Dykstra <dwd@xxxxxxxx> · Mon, 25 Jun 2007 15:02:23 -0500

Trying again, having got no response.  Any reaction to my questions?

- Dave

On Tue, Jun 12, 2007 at 11:42:42AM -0500, Dave Dykstra wrote:
> On Tue, Jun 12, 2007 at 12:19:26AM +0200, Henrik Nordstrom wrote:
> > m??n 2007-06-11 klockan 15:17 -0500 skrev Dave Dykstra:
> > 
> > > of jobs.  It quickly becomes impractical to distribute all the data from
> > > just a few nodes running squid, so I am thinking about running squid on
> > > every node, especially as the number of CPU cores per node increases.
> > > The problem then is how to determine which peer to get data from.
> > 
> > Multicast ICP sounds like it could be a reasonable option there.
> > 
> > Regards
> > Henrik
> 
> I considered that, but wouldn't multicasted ICP queries tend to get many
> hundreds of replies (on average, half the total number of squids)?  It
> would only use the first response it got back, but it doesn't seem very
> efficient of network or compute resources to throw away all the others.
> Do you know of other people who have used multicast ICP for this type of
> application? 
> 
> The multicast TTL could help a little but probably not much.  I expect
> the servers are usually organized in smaller groups, with better network
> connectivity within each group, but it isn't practical to ask the system
> administrators to tell us which servers are in which group so everything
> has to be automatic.  They're very likely all on the same large subnet
> with the switches sorting out the routing, so it isn't clear that
> anything at squid's level would be able to tell how far away servers are
> other than by small differences in response time, or more likely
> throughput of large transfers.  I also don't think we can really expect
> we know can know the names of all the peers in order to list them in
> "multicast-responder".
> 
> - Dave