Re: caching data for thousands of nodes in a compute cluster

Dave Dykstra <dwd@xxxxxxxx> · Tue, 26 Jun 2007 14:01:42 -0500

On Mon, Jun 25, 2007 at 11:57:21PM +0200, Henrik Nordstrom wrote:
> m??n 2007-06-25 klockan 15:02 -0500 skrev Dave Dykstra:
...
> > > I considered that, but wouldn't multicasted ICP queries tend to get many
> > > hundreds of replies (on average, half the total number of squids)?
> 
> Right.. so not so good when there is very very many Squid's..
> 
> you could modify the ICP code to only respond to HIT's on multicast
> queries. This would cut down the number of responses considerably..

That's a good start.  There would need to be a timeout in order to
determine when to go to the origin server (or better, to a master local
squid).

> Another option is to build a hierarchy, grouping the Squid's in smaller
> clusters, with only a selected few managing the wider connections.
> 
> It's hard to get this fully dynamic however. Some configuration will be
> needed to build the hierarchy.
> 
> You'll probably have to extend Squid a bit to get what you want running
> smoothly, with multicast ICP being one possible component to discover
> the nearby nodes and exchanges between those, but I am not familiar with
> your network topology of how the cluster nodes is connected together so
> it's just a guess.

I think it is very important for this to be widely accepted to keep the
static configuration of all the nodes the same, and to have any kind of
hierarchy dynamically discovered.

Maybe it would work to keep track of the fastest respondents to
occasional multicast queries, and to keep track of the transfer rates of
data transferred from them.  Those that are the fastest get queried
first with unicast ICP (perhaps in parallel), and if none of them have a
hit then do a multicast query.  Also, nodes that are heavily loaded need
not reply to multicast queries.

> This kind of setup could also benefit a lot from intra-array CARP. Once
> the cluster members is known CARP can be used to route the requests in a
> quite efficient manner if the network is reasonably flat.
> 
> If the network is more wan like, with significantly different levels of
> connectivity between the nodes then a more grouped layout may be needed,
> building a hierarchy ontop of the network topology.

I expect the network to be very much a LAN, but still to have
significant different levels of throughput.  For example, a cluster I
know about is planned to have full non-blocking gigabit connectivity
between the 40 nodes on each rack, but to have only a gigabit to a
central switch between each of switches of 50 racks (and each node will
be dual quad-core, for a total of 16,000 cores).  I think all of the
nodes will be on a single IP subnet, with the switches automatically
sorting out the packet routing (although I'm not sure about that).

So I don't think you could call that reasonably flat, nor that it would
help to have a single master for each object that all nodes would get
the object from (as I understand CARP would do).  Some objects could be
pretty large, say 50MB, and sending that object from one node to all the
1999 others would be much too slow, especially if there's several such
objects hosted on nodes on the same rack.  Bittorrent gets around the
large object problem by splitting them up into fixed-sized chunks and
loading them out of order, but that's not an option with http.

> Is there some kind of cluster node management, keeping track of what
> nodes exists and mapping this to the network topology? Or do everything
> need to be discovered on the fly by each node?

I want this to work on a wide diversity of clusters administered by
different people at different universities & labs, so in general it has
to be discovered on the fly.  It so happens that the cluster I was
describing above is special purpose, where every node starts the same
program at the same time, and we're planning on statically configuring
it into a fixed hierarchy of cache_peer parents (probably with each
parent serving about four children).  The networking topology of other
clusters will vary but I expect that most of them will have similar
limitations in order to keep down the cost of the networking hardware.

> Regards
> Henrik
> 

- Dave