Thanks, it's always good to know I'm not alone with problem! Also good to know I haven't missed something blindingly obvious in the config/setup. WE had our VPN drop between the DCs yesterday afternoon, which resulted in high load on 1 gluster server at a time for about 10 minutes once the VPN was back up, so unless anyone else has any ideas, I think looking at alternatives is our only way forward. I had a quick look the other day and Ceph was one of the possibilities that stood out for me. Thanks. On 14 May 2013 03:21, Toby Corkindale <toby.corkindale at strategicdata.com.au>wrote: > On 11/05/13 00:40, Matthew Day wrote: > >> Hi all, >> >> I'm pretty new to Gluster, and the company I work for uses it for >> storage across 2 data centres. An issue has cropped up fairly recently >> with regards to the self-heal mechanism. >> >> Occasionally the connection between these 2 Gluster servers breaks or >> drops momentarily. Due to the nature of the business it's highly likely >> that files have been written during this time. When the self-heal daemon >> runs it notices a discrepancy and gets the volume up to date. The >> problem we've been seeing is that this appears to cause the CPU load to >> increase massively on both servers whilst the healing process takes place. >> >> After trying to find out if there were any persistent network issues I >> tried recreating this on a test system and can now re-produce at will. >> Our test system set up is made up of 3 VMs, 2 Gluster servers and a >> client. The process to cause this was: >> Add in an iptables rule to block one of the Gluster servers from being >> reached by the other server and the client. >> Create some random files on the client. >> Flush the iptables rules out so the server is reachable again. >> Force a self heal to run. >> Watch as the load on the Gluster servers goes bananas. >> >> The problem with this is that whilst the self-heal happens one the >> gluster servers will be inaccessible from the client, meaning no files >> can be read or written, causing problems for our users. >> >> I've been searching for a solution, or at least someone else who has >> been having the same problem and not found anything. I don't know if >> this is a bug or config issue (see below for config details). I've tried >> a variety of different options but none of them have had any effect. >> > > > For what it's worth.. I get this same behaviour, and our gluster servers > aren't even in separate data centres. It's not always the self-heal daemon > that triggers it -- sometimes the client gets in first. > > Either way -- while recovery occurs, the available i/o to clients drops to > effectively nothing, and they stall until recovery completes. > > I believe this problem is most visible when your architecture contains a > lot of small files per directory. If you can change your filesystem layout > to avoid this, then you may not be hit as hard. > (eg. Take an MD5 hash of the path and filename, then store the file under > a subdirectory named after the first few characters in the hash. (2 chars > will divide the files-per-directory by ~1300, three by ~47k) eg. > "folder/file.dat" becomes "66/folder/file.dat") > > > I've given up on GlusterFS though; have a look at Ceph and RiakCS if your > systems suit Swift/S3 style storage. > > -Toby > ______________________________**_________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130514/49ed94a9/attachment.html>