On 08/02/2016 08:43 PM, Vijay Bellur wrote: > On 08/01/2016 01:29 AM, Mohammed Rafi K C wrote: >> >> >> On 07/30/2016 10:53 PM, Jay Berkenbilt wrote: >>> We're using glusterfs in Amazon EC2 and observing certain behavior >>> involving EBS volumes. The basic situation is that, in some cases, >>> clients can write data to the file system at a rate such that the >>> gluster daemon on one or more of the nodes may block in disk wait for >>> longer than 42 seconds, causing gluster to decide that the brick is >>> down. In fact, it's not down, it's just slow. I believe it is possible >>> by looking at certain system data to tell the difference from the >>> system >>> with the drive on it between down and working through its queue. >>> >>> We are attempting a two-pronged approach to solving this problem: >>> >>> 1. We would like to figure out how to tune the system, including either >>> or both of adjusting kernel parameters or glusterd, to try to avoid >>> getting the system into the state of having so much data to flush >>> out to >>> disk that it blocks in disk wait for such a long time. >>> 2. We would like to see if we can make gluster more intelligent about >>> responding to the pings so that the client side is still getting a >>> response when the remote side is just behind and not down. Though I do >>> understand that, in some high performance environments, one may want to >>> consider a disk that's not keeping up to have failed, so this may have >>> to be a tunable parameter. >>> >>> We have a small team that has been working on this problem for a couple >>> of weeks. I just joined the team on Friday. I am new to gluster, but I >>> am not at all new to low-level system programming, Linux >>> administration, >>> etc. I'm very much open to the possibility of digging into the gluster >>> code and supplying patches >> >> Welcome to Gluster. It is great to see a lot of ideas within days :). >> >> >>> if we can find a way to adjust the behavior >>> of gluster to make it behave better under these conditions. >>> >>> So, here are my questions: >>> >>> * Does anyone have experience with this type of issue who can offer any >>> suggestions on kernel parameters or gluster configurations we could >>> play >>> with? We have several kernel parameters in mind and are starting to >>> measure their affect. >>> * Does anyone have any background on how we might be able to tell that >>> the system is getting itself into this state? Again, we have some ideas >>> on this already, mostly by using sysstat to monitor stuff, though >>> ultimately if we find a reliable way to do it, we'd probably code it >>> directly by looking at the relevant stuff in /proc from our own code. I >>> don't have the details with me right now. >>> * Can someone provide any pointers to where in the gluster code the >>> ping >>> logic is handled and/or how one might go about making it a little >>> smarter? >> >> One of the user had similar problems where ping packets are queued on >> waiting list because of a huge traffic. I have a patch which try to >> solve the issue http://review.gluster.org/#/c/11935/ . Which is under >> review and might need some more work, but I guess it is worth trying >> > > > Would it be possible to rebase this patch against the latest master? I > am interested to see if we still see the pre-commit regression failures. I will do that shortly . Rafi KC > > Thanks! > Vijay > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users