[Gluster3.2@Grid5000] 128 nodes failure and rr scheduler question

tcp at gluster.com (Pavan T C) · Mon, 13 Jun 2011 10:07:26 +0530

On Sunday 12 June 2011 07:00 PM, Fran?ois Thiebolt wrote:
> Hello,
>
> To make things clear, what I've done is :
> - deploying GlusterFS on 2, 4, 8, 16, 32, 64, 128 nodes
> - running a variant of the MAB benchmark (it's all about compilation of openssl-1.0.0) on 2, 4, 8, 16, 32, 64, 128 nodes
> - I used 'pdsh -f 512' to start MAB on all nodes at the same time
> - on each experiment on each node, I ran MAB  in a dedicated directory within the glusterfs global namespace (e.g. nodeA used<gluster global namespace>/nodeA/<mab files>) to avoid a metadata storm on the parent directory inode
> - between each experiment, I destroy and redeploy a complete new GlusterFS setup (and I also destroy everything within each brick i.e the exported storage dir)
>
> I then compare the average compilation time vs the number of nodes ... and it increases due to the round robin scheduler that dispatches files on all the bricks
> 2 : Phase_V(s)avg   249.9332121175
> 4 : Phase_V(s)avg   262.808117374
> 8 : Phase_V(s)avg   293.572061537875
> 16 : Phase_V(s)avg   351.436554833375
> 32 : Phase_V(s)avg   546.503069517844
> 64 : Phase_V(s)avg   1010.61019479478
> (phase V is related to the compilation itself, previous phases are about metadata ops)
> You can also try to compile a linux kernel on your own, this is pretty much the same thing.

Thanks much for your detailed description.
Is phase_V the only phase where you are seeing reduced performance?

With regards to your problem, since you are using the bricks also as 
clients, you have a NUMA kind of scenario. In the case of two bricks 
(and hence two client), during compilation, ~50% of the files will be 
available locally for the client for which the latencies will be 
minimal, and the other 50% with suffer additional latencies. As you 
increase the number of nodes, this asymmetry is seen for more number of 
files.
So, the problem is not really the introduction of more servers, but the 
degree of asymmetry your application is seeing. Your numbers for 2 nodes 
might not be a good indicator of the average performance. Try the same 
experiment by separating the clients and the servers. If you still see 
reverse-linear performance with increased bricks/clients, we can 
investigate further.

Pavan

>
> Now regarding the GlusterFS setup : yes, you're right, there is no replication so this is a simple stripping (on a file basis) setup
> Each time, I create a glusterfs volume featuring one brick, then i add bricks (one by one) till I reach the number of nodes ... and after that, I start the volume.
> Now regarding the 128bricks case, this is when I start the volume that I get a random error telling me that<brickX>  does not respond, and this changes every time I retry to start the volume.
> So far, I didn't tested with a number of nodes between 64 and 128
>
> Fran?ois
>
> On Friday, June 10, 2011 16:38 CEST, Pavan T C<tcp at gluster.com>  wrote:
>
>> On Wednesday 08 June 2011 06:10 PM, Francois THIEBOLT wrote:
>>> Hello,
>>>
>>> I'm driving some experiments on grid'5000 with GlusterFS 3.2 and, as a
>>> first point, i've been unable to start a volume featuring 128bricks (64 ok)
>>>
>>> Then, due to the round-robin scheduler, as the number of nodes increase
>>> (every node is also a brick), the performance of an application on an
>>> individual node decrease!
>>
>> I would like to understand what you mean by "increase of nodes". You
>> have 64 bricks and each brick also acts as a client. So, where is the
>> increase in the number of nodes? Are you referring to the mounts that
>> you are doing?
>>
>> What is your gluster configuration - I mean, is it a distribute only, or
>> is it a distributed-replicate setup? [From your command sequence, it
>> should be a pure distribute, but I just want to be sure].
>>
>> What is your application like? Is it mostly I/O intensive? It will help
>> if you provide a brief description of typical operations done by your
>> application.
>>
>> How are you measuring the performance? What parameter determines that
>> you are experiencing a decrease in performance with increase in the
>> number of nodes?
>>
>> Pavan
>>
>>> So my question is : how to STOP the round-robin distribution of files
>>> over the bricks within a volume ?
>>>
>>> *** Setup ***
>>> - i'm using glusterfs3.2 from source
>>> - every node is both a client node and a brick (storage)
>>> Commands :
>>> - gluster peer probe<each of the 128nodes>
>>> - gluster volume create myVolume transport tcp<128 bricks:/storage>
>>> - gluster volume start myVolume (fails with 128 bricks!)
>>> - mount -t glusterfs ...... on all nodes
>>>
>>> Feel free to tell me how to improve things
>>>
>>> Fran?ois
>>>
>>
>
>
>
>