Self-heal and high load

hans at shapeways.com (Hans Lambermont) · Thu, 16 May 2013 14:51:46 +0200

Hi Robert,

Robert Hajime Lanning wrote on 20130516:

> If you are not using GlusterFS NFS services, have you tried linux tc
> or traffic isolation?

No, but I doubt it will help me as my issue is not bandwidth related at
all but rather disk IOPS related. The bricks are saturated on IOPS by
gluster management tools.  (This is a many directories+files setup, not
a VM images store or something).

> You don't want to make a static setting, as that will limit your
> rebalance/self-heal too low at idle times and won't back off enough
> during high load times.  So, you need it to be dynamic based on
> "available" IO.

It would be ideal if we could limit self-heal (or brick-replace or
rebalance) generated IOPS. Maybe even per brick. My disks can take an
additional 100 IOPS for gluster management but not much more.

> This is definitely a "not easy" problem to solve.

I hope IOPS throttling is somewhat easier.

regards,
   Hans Lambermont

> On 05/16/13 02:54, Hans Lambermont wrote:
>> Hi all,
>> 
>> My production setup also suffers from total unavailablility outages when
>> self-heal gets real work to do. On a 4 server distributed-replicate 14x2
>> cluster where 1 server has been down for 2 days the volume becomes
>> completely unresponsive when we bring the server back into the cluster.
>> 
>> I ticketed it here : https://bugzilla.redhat.com/show_bug.cgi?id=963223
>> "Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes
>> the volume"
>> 
>> Does anyone know of a way to slow down self-heal so that it does not
>> make the volume unresponsive ?
>> 
>> 
>> The "unavailability due to high load caused by gluster itself" pattern
>> repeats itself in several cases :
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=950024 replace-brick
>> immediately saturates IO on source brick causing the entire volume to be
>> unavailable, then dies
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=950006 replace-brick
>> activity dies, destination glusterfs spins at 100% CPU forever
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=832609 Glusterfsd hangs if
>> brick filesystem becomes unresponsive, causing all clients to lock up
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=962875 Entire volume DOSes
>> itself when a node reboots and runs fsck on its bricks while network is up
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=963223 Re-inserting a server
>> in a v3.3.2qa2 distributed-replicate volume DOSes the volume
>> 
>> There's probably more, but these are the ones that affected my servers.
>> 
>> I also had to stop a rebalance action due to too high load on the above
>> 3 out-of 4 servers cluster causing another service unavailablility
>> outage. This might be related to 1 server being down as rebalance
>> 'behaved' better before. I made no ticket for this yet.
>> 
>> The pattern must really be fixed, rather sooner than later, as it makes
>> running a production level service with gluster impossible.
>> 
>> regards,
>>     Hans Lambermont