On 08/02/2016 08:22 PM, Jay Berkenbilt wrote: > So we managed to work around the behavior by setting > > sysctl -w vm.dirty_bytes=50000000 > sysctl -w vm.dirty_background_bytes=25000000 > > In our environment with our specific load testing, this prevents the > disk flush from taking longer than gluster's timeout and avoids the > whole problem with gluster timing out. We haven't finished our > performance testing, but initial results suggest that it is no worse > than the performance we had with our previous home-grown solution. In > our previous home grown solution, we had a fuse layer that was calling > fsync() on every megabyte written as soon as there were 10 megabytes > worth of requests in the queue, which was effectively emulating in user > code what these kernel parameters do but with even smaller numbers. Great to know you managed to workaround it. > > Thanks for the note below about the potential patch. I applied this to > 3.8.1 with the fix based on the code review comment and have that in my > back pocket in case we need it, but we're going to try with just the > kernel tuning for now. These parameters are decent for us anyway > because, for other reasons based on the nature of our application and > certain customer requirements, we want to keep the amount of dirty data > really low. > > It looks like the code review has been idle for some time. Any reason? > It looks like a simple and relatively obvious change (not to take > anything away from it at all, and I really appreciate the pointer). Is > there anything potentially unsafe about it? Like are there some cases > where not always appending to the queue could cause damage to data if > the test wasn't exactly right or wasn't doing exactly what it was > expecting? If I were to run our load test against the patch, it wouldn't > catch anything like that because we don't actually look at the content > of the data written in our load test. I believe it will not result in any kind of data loss scenario, but we can have more discussion and reviews on this area. I got reviews from Shyam, and will continue the discussion in gerrit as a patch comment. We would be very happy to have your valuable suggestions for the patch or for any other solution . > In any case, if the kernel tuning > doesn't completely solve the problem for us, I may pull this out and do > some more rigorous testing against it. If I do, I can comment on the > code change. Great. I will rebase the patch so that if needed you can cleanly apply yo master code. > > For now, unless I post otherwise, we're considering our specific problem > to be resolved, though I believe there remains a potential weakness in > gluster's ability to report that it is still up in the case of a slower > disk write speed on one of the nodes. > > --Jay > > On 08/01/2016 01:29 AM, Mohammed Rafi K C wrote: >> On 07/30/2016 10:53 PM, Jay Berkenbilt wrote: >>> We're using glusterfs in Amazon EC2 and observing certain behavior >>> involving EBS volumes. The basic situation is that, in some cases, >>> clients can write data to the file system at a rate such that the >>> gluster daemon on one or more of the nodes may block in disk wait for >>> longer than 42 seconds, causing gluster to decide that the brick is >>> down. In fact, it's not down, it's just slow. I believe it is possible >>> by looking at certain system data to tell the difference from the system >>> with the drive on it between down and working through its queue. >>> >>> We are attempting a two-pronged approach to solving this problem: >>> >>> 1. We would like to figure out how to tune the system, including either >>> or both of adjusting kernel parameters or glusterd, to try to avoid >>> getting the system into the state of having so much data to flush out to >>> disk that it blocks in disk wait for such a long time. >>> 2. We would like to see if we can make gluster more intelligent about >>> responding to the pings so that the client side is still getting a >>> response when the remote side is just behind and not down. Though I do >>> understand that, in some high performance environments, one may want to >>> consider a disk that's not keeping up to have failed, so this may have >>> to be a tunable parameter. >>> >>> We have a small team that has been working on this problem for a couple >>> of weeks. I just joined the team on Friday. I am new to gluster, but I >>> am not at all new to low-level system programming, Linux administration, >>> etc. I'm very much open to the possibility of digging into the gluster >>> code and supplying patches >> Welcome to Gluster. It is great to see a lot of ideas within days :). >> >> >>> if we can find a way to adjust the behavior >>> of gluster to make it behave better under these conditions. >>> >>> So, here are my questions: >>> >>> * Does anyone have experience with this type of issue who can offer any >>> suggestions on kernel parameters or gluster configurations we could play >>> with? We have several kernel parameters in mind and are starting to >>> measure their affect. >>> * Does anyone have any background on how we might be able to tell that >>> the system is getting itself into this state? Again, we have some ideas >>> on this already, mostly by using sysstat to monitor stuff, though >>> ultimately if we find a reliable way to do it, we'd probably code it >>> directly by looking at the relevant stuff in /proc from our own code. I >>> don't have the details with me right now. >>> * Can someone provide any pointers to where in the gluster code the ping >>> logic is handled and/or how one might go about making it a little smarter? >> One of the user had similar problems where ping packets are queued on >> waiting list because of a huge traffic. I have a patch which try to >> solve the issue http://review.gluster.org/#/c/11935/ . Which is under >> review and might need some more work, but I guess it is worth trying >> >> If your interested, you can try it out and let me know whether it solve >> the issue or not. What the patch does is, it consider PING packets as >> the most prioritized packets, and will add into the beginning of ioq >> list (list which contains packet to be send via wire) . >> >> >> I might have missed some important points from the long mail ;). I'm >> sorry, I was too lazy to read it completely :). >> >> Regards >> Rafi KC >> >>> * Does my description of what we're dealing with suggest that we're just >>> missing something obvious? I jokingly asked the team whether they had >>> remembered to run glusterd with the --make-it-fast flag, but sometimes >>> there are solutions almost like that that we just overlook. >>> >>> For what it's worth, we're running gluster 3.8 on CentOS 7 in EC2. We >>> see the problem the most strongly when using general purpose (gp2) EBS >>> volumes on higher performance but non-EBS optimized volumes where it's >>> pretty easy to overload the disk with traffic over the network. We can >>> mostly mitigate this by using provisioned I/O volumes or EBS optimized >>> volumes on slower instances where the disk outperforms what we can throw >>> at it over the network. Yet at our scale, switching to EBS optimization >>> would cost hundreds of thousands of dollars a year, and running slower >>> instances has obvious drawbacks. In the absence of a "real" solution, we >>> will probably end up trying to modify our software to throttle writes to >>> disk, but having to modify our software to keep from flooding the file >>> system seems like a really sad thing to have to do. >>> >>> Thanks in advance for any pointers! >>> >>> --Jay >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> http://www.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users