Gluster crashes when cascading AFR

rainer.schwemmer at cern.ch (Rainer Schwemmer) · Tue, 16 Dec 2008 15:18:26 +0100

Hello all,

Thanks for all the suggestions so far.

We did consider bit torrent. Unfortunately it takes quite a lot of time
to index the whole repository and it makes changing single files rather
cumbersome if people need to apply a quick fix to a certain component in
the repository.

I did try the replication just to the 50 nodes of the first tier and
this seems to work fairly well. In case i can't get the two tiered setup
to work this will be the solution I'll go for.

There seems to be a bit of confusion of what I'm trying to do, or i did
not understand the caching part that some of you are suggesting.

The plan is to use AFR to write a copy of the repository onto the local
disks of each of the 2000 cluster nodes. Since gluster uses the
underlying ext3 file system and just puts the AFRed files onto the
disks, i should be able to read the repository data directly via ext3 on
the cluster nodes, once replication is completed.
This way i can also use the linux built in FS cache. I would just use
the root node of the hierarchy to throw in new files to be replicated on
all the cluster nodes when necessary.

Cheers,
  Rainer

On Mon, 2008-12-15 at 14:09 -0800, Keith Freedman wrote:
> here are my thoughts.
> 
> the value in having multiple tiers of AFR'ed 
> nodes is simply to help aggregate bandwidth.
> instead of having 2000 nodes fighting for data 
> over the same network and disk, you're distributing that across multiple nodes.
> 
> So I think your architecture is valid.  You could 
> use client side caching as has been suggested to 
> increase performance on the clients so they're 
> reading from local disk instead of over the network all the time.
> 
> As I understand it, there's still going to be 
> some file access performance issues.
> your client requests a file form it's server.
> that server checks with the other 49 in it's AFR 
> config to insure it's got the latest version of the file (or it auto-heals).
> Since this is actually a filesytem that it's 
> getting from another server, that server also 
> checks with it's 49 peers.  This should be fine, 
> although what I'm not clear on is whether or not
> all 49 in the first tier are going to cause the 
> same 49 checks.   my guess is some of them will 
> be cached since they'll be redundant, but I've no idea how long this will take.
> 
> perhaps one of the devs can chime in on this aspect.
> 
> I'd think that since you're in a primarily read 
> environment, you'll ultimately still benefit over 
> a single NFS server because once the afr 
> checks/auto-heal is done, you have fewer clients 
> competing for bandwidth so things will ultimately 
> be much faster, you may just have longer delays 
> before the data starts moving if you're 
> monitoring your port traffic on the clients, but 
> in the long run, (file request to file delivery time) you'll be better off.
> 
> 
> 
> At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:
> >Hello!
> >
> >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > > Hello all,
> > >
> > > I am trying to set up a file replication scheme for our cluster of about
> > > 2000 nodes. I'm not sure if what i am doing is actually feasible, so
> > > I'll best just start from the beginning and maybe one of you knows even
> > > a better way to do this.
> > >
> > > Basically we have a farm of 2000 machines which are running a certain
> > > application that, during start up, reads about 300 MB of data (out of a
> > > 6 GB repository) of program libraries, geometry data etc and this 8
> > > times per node. Once per core on every machine. The data is not modified
> > > by the program so it can be regarded as read only. When the application
> > > is launched it is launched on all nodes simultaneously and especially
> > > now during debugging this is done very often (within minutes).
> >
> >[...]
> >
> > > The interconnect between nodes is TCP/IP over Ethernet.
> >
> >I apologize in advance for not saying much about advanced GlusterFS
> >setups in this post. :-)
> >
> >Before trying a multi-level AFR I'd rule out that a basic AFR setup
> >would not be able to do the job. Try TSTTCPW (The Simplest Thing That
> >Could Possibly Work) - and do some benchmarks. IMHO, anything faster
> >than your NFS server would be an improvement.
> >
> >On setup might be an AFR'd volume on node A and nodes B and exporting
> >that to the clients like a server side AFR.
> >(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
> >Using 20 nodes "B", each one would have ~100 clients.
> >
> >Reexporting the AFR'd GlusterFS volume over NFS would make changes to
> >the client nodes unnecessary.
> >
> ><different ideas>
> >
> >When I read '2000 machines' and 'read only' I thought of this page:
> >
> >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
> >
> >Would it be possible to use some peer-to-peer software to distribute
> >the program and data files to the local disks?
> >
> >
> >I don't have any experience with networks of that size so I did some
> >calculations using optimistic estimated values:
> >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
> >over Gigabit Ethernet estimated at a maximum of 100MB/s the data
> >transfer for start up would take 3s/core = 24s/node = 48000s total =
> >~13.3 hours.
> >Is that anywhere near the time it really takes or did I misread some
> >information?
> >
> >With 10 Gigabit Ethernet and a NFS server powerful enough to use it
> >that time might be reduced by a factor of 10 to ~1.3 hours.
> >
> >Using Gigabit Ethernet and running bittorrent on every node might
> >download a 6GB tar of the complete repository and unpack it to all the
> >local disks within less than 2 hours. Using a compressed file might be
> >a lot faster, depending on compression ratio.
> >
> >Harald St?rzebecher
> >
> >_______________________________________________
> >Gluster-users mailing list
> >Gluster-users at gluster.org
> >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>