Gluster crashes when cascading AFR

haralds at cs.tu-berlin.de (Harald Stürzebecher) · Mon, 15 Dec 2008 22:24:17 +0100

Hello!

2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> Hello all,
>
> I am trying to set up a file replication scheme for our cluster of about
> 2000 nodes. I'm not sure if what i am doing is actually feasible, so
> I'll best just start from the beginning and maybe one of you knows even
> a better way to do this.
>
> Basically we have a farm of 2000 machines which are running a certain
> application that, during start up, reads about 300 MB of data (out of a
> 6 GB repository) of program libraries, geometry data etc and this 8
> times per node. Once per core on every machine. The data is not modified
> by the program so it can be regarded as read only. When the application
> is launched it is launched on all nodes simultaneously and especially
> now during debugging this is done very often (within minutes).

[...]

> The interconnect between nodes is TCP/IP over Ethernet.

I apologize in advance for not saying much about advanced GlusterFS
setups in this post. :-)

Before trying a multi-level AFR I'd rule out that a basic AFR setup
would not be able to do the job. Try TSTTCPW (The Simplest Thing That
Could Possibly Work) - and do some benchmarks. IMHO, anything faster
than your NFS server would be an improvement.

On setup might be an AFR'd volume on node A and nodes B and exporting
that to the clients like a server side AFR.
(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
Using 20 nodes "B", each one would have ~100 clients.

Reexporting the AFR'd GlusterFS volume over NFS would make changes to
the client nodes unnecessary.

<different ideas>

When I read '2000 machines' and 'read only' I thought of this page:

http://wiki.systemimager.org/index.php/BitTorrent#Benchmark

Would it be possible to use some peer-to-peer software to distribute
the program and data files to the local disks?

I don't have any experience with networks of that size so I did some
calculations using optimistic estimated values:
Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
over Gigabit Ethernet estimated at a maximum of 100MB/s the data
transfer for start up would take 3s/core = 24s/node = 48000s total =
~13.3 hours.
Is that anywhere near the time it really takes or did I misread some
information?

With 10 Gigabit Ethernet and a NFS server powerful enough to use it
that time might be reduced by a factor of 10 to ~1.3 hours.

Using Gigabit Ethernet and running bittorrent on every node might
download a 6GB tar of the complete repository and unpack it to all the
local disks within less than 2 hours. Using a compressed file might be
a lot faster, depending on compression ratio.

Harald St?rzebecher