Frequent glusterd restarts needed to avoid NFS performance degradation

d.a.bretherton at reading.ac.uk (Dan Bretherton) · Wed, 25 Apr 2012 14:34:30 +0100

Dear Brian and Paul,
Thanks for reporting your NFS performance degradation problems; I'm glad 
I'm not the only one who has it.  My 20 node storage cluster has a 
number of fairly standard replicated-distributed volumes; I don't use 
striping.
> I've also been considering writing a cronjob
> to fix this - have you made any progress on this, anything to report?
I made my compute cluster nodes part of the storage cluster a couple of 
months ago as described here:

http://community.gluster.org/a/nfs-performance-with-fuse-client-redundancy/

A few days ago I set up a cron job to restart glusterd on the compute 
nodes every day at about 2AM.   So far there haven't been any reported 
problems and long running jobs have been unaffected.  I thought this 
would be potentially less disruptive than automatically restarting 
glusterd on the storage servers, because those do a lot more than just 
provide NFS.  I have been using the GlusterFS servers to export NFS to 
less important machines, but I now plan to use the compute nodes for all 
NFS exports in order to take advantage of the daily glusterd restart.  
This isn't an ideal situation because the compute nodes get very busy at 
times and tend to suffer more down time than the storage servers.  I 
thought about having a dedicated compute server just for GlusterFS 
exports, but I don't have enough in the budget for that at the moment.  
My other worry is that other GlusterFS related processes on the storage 
servers will slow down with use, not just NFS.

> What sort of tasks are you using your gluster for?
The compute cluster is mainly used to run various climate and 
meteorology related models and associated data analysis and processing 
applications, all reading from and writing to GlusterFS volumes.
> Ours is for a
> render farm, so we see a very large number of mounts/unmounts as render
> nodes mount various parts of the filesystem. I wonder if this has anything
> to do with it; is your use case anything similar?
I don't think our models and applications do a lot of mounting and 
unmounting; volumes usually stay mounted while compute cluster jobs are 
using the data, and there are also quite a lot of interactive shells 
keeping volumes mounted for long periods.

-Dan.

On 04/23/2012 08:00 PM, gluster-users-request at gluster.org wrote:
> Date: Mon, 23 Apr 2012 19:24:14 +0100
> From: Paul Simpson<paul at realisestudio.com>
> Subject: Re: Frequent glusterd restarts needed to
> 	avoid NFS performance degradation
> To: Brian Cipriano<bcipriano at zerovfx.com>
> Cc: gluster-users at gluster.org
> Message-ID:
> 	<CAOFxjOTGSS3mFve=EktgAZRaQz3XiZLoZU-EvEByCV6H=m1cfw at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> just like to add that we sometimes need to restart glusterd on servers too.
>   again - on a renderfarm that hammers our 4 server dist/repl servers
> heavily.
>
> -p
>
>
> On 23 April 2012 15:38, Brian Cipriano<bcipriano at zerovfx.com>  wrote:
>> Hi Dan - I've seen this problem too. I agree with everything you've
>> described - seems to happen more quickly on more heavily used volumes, and
>> a restart fixes it right away. I've also been considering writing a cronjob
>> to fix this - have you made any progress on this, anything to report?
>>
>> I'm running a fairly simple distributed, non-replicated volume across two
>> servers. What sort of tasks are you using your gluster for? Ours is for a
>> render farm, so we see a very large number of mounts/unmounts as render
>> nodes mount various parts of the filesystem. I wonder if this has anything
>> to do with it; is your use case anything similar?
>>
>> - brian
>>
>>
>> On 4/17/12 7:30 PM, Dan Bretherton wrote:
>>> Dear All-
>>> I find that I have to restart glusterd every few days on my servers to
>>> stop NFS performance from becoming unbearably slow.  When the problem
>>> occurs, volumes can take several minutes to mount and there are long delays
>>> responding to "ls".   Mounting from a different server, i.e. one not
>>> normally used for NFS export, results in normal NFS access speeds.  This
>>> doesn't seem to have anything to do with load because it happens whether or
>>> not there is anything running on the compute servers.  Even when the system
>>> is mostly idle there are often a lot of glusterfsd processes running, and
>>> on several of the servers I looked at this evening there is a process
>>> called glusterfs using 100% of one CPU.  I can't find anything unusual in
>>> nfs.log or etc-glusterfs-glusterd.vol.log on the servers affected.
>>>   Restarting glusterd seems to stop this strange behaviour and make NFS
>>> access run smoothly again, but this usually only lasts for a day or two.
>>>
>>> This behaviour is not necessarily related to the length of time since
>>> glusterd was started, but has more to do with the amount of work the
>>> GlusterFS processes on each server have to do.  I use a different server to
>>> export each of my 8 different volumes, and the NFS performance degradation
>>> seems to affect the most heavily used volumes more than the others.  I
>>> really need to find a solution to this problem; all I can think of doing is
>>> setting up a cron job on each server to restart glusterd every day, but I
>>> am worried about what side effects that might have.  I am using GlusterFS
>>> version 3.2.5.  All suggestions would be much appreciated.
>>>
>>> Regards,
>>> Dan.
>>> ______________________________**_________________