[Gluster3.2@Grid5000] 128 nodes failure and rr scheduler question

Francois.Thiebolt at irit.fr (François Thiebolt) · Sun, 12 Jun 2011 15:30:20 +0200

Hello,

To make things clear, what I've done is :
- deploying GlusterFS on 2, 4, 8, 16, 32, 64, 128 nodes
- running a variant of the MAB benchmark (it's all about compilation of openssl-1.0.0) on 2, 4, 8, 16, 32, 64, 128 nodes
- I used 'pdsh -f 512' to start MAB on all nodes at the same time
- on each experiment on each node, I ran MAB  in a dedicated directory within the glusterfs global namespace (e.g. nodeA used <gluster global namespace>/nodeA/<mab files>) to avoid a metadata storm on the parent directory inode
- between each experiment, I destroy and redeploy a complete new GlusterFS setup (and I also destroy everything within each brick i.e the exported storage dir)

I then compare the average compilation time vs the number of nodes ... and it increases due to the round robin scheduler that dispatches files on all the bricks
2 : Phase_V(s)avg   249.9332121175
4 : Phase_V(s)avg   262.808117374
8 : Phase_V(s)avg   293.572061537875
16 : Phase_V(s)avg   351.436554833375
32 : Phase_V(s)avg   546.503069517844
64 : Phase_V(s)avg   1010.61019479478
(phase V is related to the compilation itself, previous phases are about metadata ops)
You can also try to compile a linux kernel on your own, this is pretty much the same thing.

Now regarding the GlusterFS setup : yes, you're right, there is no replication so this is a simple stripping (on a file basis) setup
Each time, I create a glusterfs volume featuring one brick, then i add bricks (one by one) till I reach the number of nodes ... and after that, I start the volume.
Now regarding the 128bricks case, this is when I start the volume that I get a random error telling me that <brickX> does not respond, and this changes every time I retry to start the volume.
So far, I didn't tested with a number of nodes between 64 and 128

Fran?ois

On Friday, June 10, 2011 16:38 CEST, Pavan T C <tcp at gluster.com> wrote: 

> On Wednesday 08 June 2011 06:10 PM, Francois THIEBOLT wrote:
> > Hello,
> >
> > I'm driving some experiments on grid'5000 with GlusterFS 3.2 and, as a
> > first point, i've been unable to start a volume featuring 128bricks (64 ok)
> >
> > Then, due to the round-robin scheduler, as the number of nodes increase
> > (every node is also a brick), the performance of an application on an
> > individual node decrease!
> 
> I would like to understand what you mean by "increase of nodes". You 
> have 64 bricks and each brick also acts as a client. So, where is the 
> increase in the number of nodes? Are you referring to the mounts that 
> you are doing?
> 
> What is your gluster configuration - I mean, is it a distribute only, or 
> is it a distributed-replicate setup? [From your command sequence, it 
> should be a pure distribute, but I just want to be sure].
> 
> What is your application like? Is it mostly I/O intensive? It will help 
> if you provide a brief description of typical operations done by your 
> application.
> 
> How are you measuring the performance? What parameter determines that 
> you are experiencing a decrease in performance with increase in the 
> number of nodes?
> 
> Pavan
> 
> > So my question is : how to STOP the round-robin distribution of files
> > over the bricks within a volume ?
> >
> > *** Setup ***
> > - i'm using glusterfs3.2 from source
> > - every node is both a client node and a brick (storage)
> > Commands :
> > - gluster peer probe <each of the 128nodes>
> > - gluster volume create myVolume transport tcp <128 bricks:/storage>
> > - gluster volume start myVolume (fails with 128 bricks!)
> > - mount -t glusterfs ...... on all nodes
> >
> > Feel free to tell me how to improve things
> >
> > Fran?ois
> >
>