If you are not using GlusterFS NFS services, have you tried linux tc or traffic isolation? If you put traffic control on inter-node traffic, that will limit the rebalance/self-heal IO or you can move inter-node traffic to its own network interface, using routing and /etc/hosts entries. I would expect that one of the issues in controlling the rebalance/self-heal IO at the glusterfsd level is hooking into the kernel for traffic information on the interface it is routing through. Since both activities are push based, the receiver needs to know its full IO picture via the kernel and push back accordingly. You don't want to make a static setting, as that will limit your rebalance/self-heal too low at idle times and won't back off enough during high load times. So, you need it to be dynamic based on "available" IO. This is definitely a "not easy" problem to solve. On 05/16/13 02:54, Hans Lambermont wrote: > Hi all, > > My production setup also suffers from total unavailablility outages when > self-heal gets real work to do. On a 4 server distributed-replicate 14x2 > cluster where 1 server has been down for 2 days the volume becomes > completely unresponsive when we bring the server back into the cluster. > > I ticketed it here : https://bugzilla.redhat.com/show_bug.cgi?id=963223 > "Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes > the volume" > > Does anyone know of a way to slow down self-heal so that it does not > make the volume unresponsive ? > > > The "unavailability due to high load caused by gluster itself" pattern > repeats itself in several cases : > > https://bugzilla.redhat.com/show_bug.cgi?id=950024 replace-brick > immediately saturates IO on source brick causing the entire volume to be > unavailable, then dies > > https://bugzilla.redhat.com/show_bug.cgi?id=950006 replace-brick > activity dies, destination glusterfs spins at 100% CPU forever > > https://bugzilla.redhat.com/show_bug.cgi?id=832609 Glusterfsd hangs if > brick filesystem becomes unresponsive, causing all clients to lock up > > https://bugzilla.redhat.com/show_bug.cgi?id=962875 Entire volume DOSes > itself when a node reboots and runs fsck on its bricks while network is up > > https://bugzilla.redhat.com/show_bug.cgi?id=963223 Re-inserting a server > in a v3.3.2qa2 distributed-replicate volume DOSes the volume > > There's probably more, but these are the ones that affected my servers. > > I also had to stop a rebalance action due to too high load on the above > 3 out-of 4 servers cluster causing another service unavailablility > outage. This might be related to 1 server being down as rebalance > 'behaved' better before. I made no ticket for this yet. > > The pattern must really be fixed, rather sooner than later, as it makes > running a production level service with gluster impossible. > > regards, > Hans Lambermont -- Mr. Flibble King of the Potato People