Gluster crashes when cascading AFR

freedman at FreeFormIT.com (Keith Freedman) · Mon, 15 Dec 2008 14:09:34 -0800

here are my thoughts.

the value in having multiple tiers of AFR'ed 
nodes is simply to help aggregate bandwidth.
instead of having 2000 nodes fighting for data 
over the same network and disk, you're distributing that across multiple nodes.

So I think your architecture is valid.  You could 
use client side caching as has been suggested to 
increase performance on the clients so they're 
reading from local disk instead of over the network all the time.

As I understand it, there's still going to be 
some file access performance issues.
your client requests a file form it's server.
that server checks with the other 49 in it's AFR 
config to insure it's got the latest version of the file (or it auto-heals).
Since this is actually a filesytem that it's 
getting from another server, that server also 
checks with it's 49 peers.  This should be fine, 
although what I'm not clear on is whether or not
all 49 in the first tier are going to cause the 
same 49 checks.   my guess is some of them will 
be cached since they'll be redundant, but I've no idea how long this will take.

perhaps one of the devs can chime in on this aspect.

I'd think that since you're in a primarily read 
environment, you'll ultimately still benefit over 
a single NFS server because once the afr 
checks/auto-heal is done, you have fewer clients 
competing for bandwidth so things will ultimately 
be much faster, you may just have longer delays 
before the data starts moving if you're 
monitoring your port traffic on the clients, but 
in the long run, (file request to file delivery time) you'll be better off.

At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:
>Hello!
>
>2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > Hello all,
> >
> > I am trying to set up a file replication scheme for our cluster of about
> > 2000 nodes. I'm not sure if what i am doing is actually feasible, so
> > I'll best just start from the beginning and maybe one of you knows even
> > a better way to do this.
> >
> > Basically we have a farm of 2000 machines which are running a certain
> > application that, during start up, reads about 300 MB of data (out of a
> > 6 GB repository) of program libraries, geometry data etc and this 8
> > times per node. Once per core on every machine. The data is not modified
> > by the program so it can be regarded as read only. When the application
> > is launched it is launched on all nodes simultaneously and especially
> > now during debugging this is done very often (within minutes).
>
>[...]
>
> > The interconnect between nodes is TCP/IP over Ethernet.
>
>I apologize in advance for not saying much about advanced GlusterFS
>setups in this post. :-)
>
>Before trying a multi-level AFR I'd rule out that a basic AFR setup
>would not be able to do the job. Try TSTTCPW (The Simplest Thing That
>Could Possibly Work) - and do some benchmarks. IMHO, anything faster
>than your NFS server would be an improvement.
>
>On setup might be an AFR'd volume on node A and nodes B and exporting
>that to the clients like a server side AFR.
>(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
>Using 20 nodes "B", each one would have ~100 clients.
>
>Reexporting the AFR'd GlusterFS volume over NFS would make changes to
>the client nodes unnecessary.
>
><different ideas>
>
>When I read '2000 machines' and 'read only' I thought of this page:
>
>http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
>
>Would it be possible to use some peer-to-peer software to distribute
>the program and data files to the local disks?
>
>
>I don't have any experience with networks of that size so I did some
>calculations using optimistic estimated values:
>Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
>over Gigabit Ethernet estimated at a maximum of 100MB/s the data
>transfer for start up would take 3s/core = 24s/node = 48000s total =
>~13.3 hours.
>Is that anywhere near the time it really takes or did I misread some
>information?
>
>With 10 Gigabit Ethernet and a NFS server powerful enough to use it
>that time might be reduced by a factor of 10 to ~1.3 hours.
>
>Using Gigabit Ethernet and running bittorrent on every node might
>download a 6GB tar of the complete repository and unpack it to all the
>local disks within less than 2 hours. Using a compressed file might be
>a lot faster, depending on compression ratio.
>
>Harald St?rzebecher
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users