Re: managing slow drives in cluster

Mohammed Rafi K C <rkavunga@xxxxxxxxxx> · Wed, 3 Aug 2016 11:32:35 +0530

On 08/02/2016 08:22 PM, Jay Berkenbilt wrote:
> So we managed to work around the behavior by setting
>
> sysctl -w vm.dirty_bytes=50000000
> sysctl -w vm.dirty_background_bytes=25000000
>
> In our environment with our specific load testing, this prevents the
> disk flush from taking longer than gluster's timeout and avoids the
> whole problem with gluster timing out. We haven't finished our
> performance testing, but initial results suggest that it is no worse
> than the performance we had with our previous home-grown solution. In
> our previous home grown solution, we had a fuse layer that was calling
> fsync() on every megabyte written as soon as there were 10 megabytes
> worth of requests in the queue, which was effectively emulating in user
> code what these kernel parameters do but with even smaller numbers.

Great to know you managed to workaround it.

>
> Thanks for the note below about the potential patch. I applied this to
> 3.8.1 with the fix based on the code review comment and have that in my
> back pocket in case we need it, but we're going to try with just the
> kernel tuning for now. These parameters are decent for us anyway
> because, for other reasons based on the nature of our application and
> certain customer requirements, we want to keep the amount of dirty data
> really low.
>
> It looks like the code review has been idle for some time. Any reason?
> It looks like a simple and relatively obvious change (not to take
> anything away from it at all, and I really appreciate the pointer). Is
> there anything potentially unsafe about it? Like are there some cases
> where not always appending to the queue could cause damage to data if
> the test wasn't exactly right or wasn't doing exactly what it was
> expecting? If I were to run our load test against the patch, it wouldn't
> catch anything like that because we don't actually look at the content
> of the data written in our load test.

I believe it will not result in any kind of data loss scenario, but we
can have more discussion and reviews on this area. I got reviews from
Shyam, and will continue the discussion in gerrit as a patch comment.

We would be very happy to have your valuable suggestions for the patch
or for any other solution .

> In any case, if the kernel tuning
> doesn't completely solve the problem for us, I may pull this out and do
> some more rigorous testing against it. If I do, I can comment on the
> code change.

Great. I will rebase the patch so that if needed you can cleanly apply
yo master code.

>
> For now, unless I post otherwise, we're considering our specific problem
> to be resolved, though I believe there remains a potential weakness in
> gluster's ability to report that it is still up in the case of a slower
> disk write speed on one of the nodes.
>
> --Jay
>
> On 08/01/2016 01:29 AM, Mohammed Rafi K C wrote:
>> On 07/30/2016 10:53 PM, Jay Berkenbilt wrote:
>>> We're using glusterfs in Amazon EC2 and observing certain behavior
>>> involving EBS volumes. The basic situation is that, in some cases,
>>> clients can write data to the file system at a rate such that the
>>> gluster daemon on one or more of the nodes may block in disk wait for
>>> longer than 42 seconds, causing gluster to decide that the brick is
>>> down. In fact, it's not down, it's just slow. I believe it is possible
>>> by looking at certain system data to tell the difference from the system
>>> with the drive on it between down and working through its queue.
>>>
>>> We are attempting a two-pronged approach to solving this problem:
>>>
>>> 1. We would like to figure out how to tune the system, including either
>>> or both of adjusting kernel parameters or glusterd, to try to avoid
>>> getting the system into the state of having so much data to flush out to
>>> disk that it blocks in disk wait for such a long time.
>>> 2. We would like to see if we can make gluster more intelligent about
>>> responding to the pings so that the client side is still getting a
>>> response when the remote side is just behind and not down. Though I do
>>> understand that, in some high performance environments, one may want to
>>> consider a disk that's not keeping up to have failed, so this may have
>>> to be a tunable parameter.
>>>
>>> We have a small team that has been working on this problem for a couple
>>> of weeks. I just joined the team on Friday. I am new to gluster, but I
>>> am not at all new to low-level system programming, Linux administration,
>>> etc. I'm very much open to the possibility of digging into the gluster
>>> code and supplying patches
>> Welcome to Gluster. It is great to see a lot of ideas within days :).
>>
>>
>>>  if we can find a way to adjust the behavior
>>> of gluster to make it behave better under these conditions.
>>>
>>> So, here are my questions:
>>>
>>> * Does anyone have experience with this type of issue who can offer any
>>> suggestions on kernel parameters or gluster configurations we could play
>>> with? We have several kernel parameters in mind and are starting to
>>> measure their affect.
>>> * Does anyone have any background on how we might be able to tell that
>>> the system is getting itself into this state? Again, we have some ideas
>>> on this already, mostly by using sysstat to monitor stuff, though
>>> ultimately if we find a reliable way to do it, we'd probably code it
>>> directly by looking at the relevant stuff in /proc from our own code. I
>>> don't have the details with me right now.
>>> * Can someone provide any pointers to where in the gluster code the ping
>>> logic is handled and/or how one might go about making it a little smarter?
>> One of the user had similar problems where ping packets are queued on
>> waiting list because of a huge traffic. I have a patch which try to
>> solve the issue http://review.gluster.org/#/c/11935/ . Which is under
>> review and might need some more work, but I guess it is worth trying
>>
>> If your interested, you can try it out and let me know whether it solve
>> the issue or not. What the patch does is, it consider PING packets as
>> the most prioritized packets, and will add into the beginning of ioq
>> list (list which contains packet to be send via wire) .
>>
>>
>> I might have missed some important points from the long mail ;). I'm
>> sorry, I was too lazy to read it completely :).
>>
>> Regards
>> Rafi KC
>>
>>> * Does my description of what we're dealing with suggest that we're just
>>> missing something obvious? I jokingly asked the team whether they had
>>> remembered to run glusterd with the --make-it-fast flag, but sometimes
>>> there are solutions almost like that that we just overlook.
>>>
>>> For what it's worth, we're running gluster 3.8 on CentOS 7 in EC2. We
>>> see the problem the most strongly when using general purpose (gp2) EBS
>>> volumes on higher performance but non-EBS optimized volumes where it's
>>> pretty easy to overload the disk with traffic over the network. We can
>>> mostly mitigate this by using provisioned I/O volumes or EBS optimized
>>> volumes on slower instances where the disk outperforms what we can throw
>>> at it over the network. Yet at our scale, switching to EBS optimization
>>> would cost hundreds of thousands of dollars a year, and running slower
>>> instances has obvious drawbacks. In the absence of a "real" solution, we
>>> will probably end up trying to modify our software to throttle writes to
>>> disk, but having to modify our software to keep from flooding the file
>>> system seems like a really sad thing to have to do.
>>>
>>> Thanks in advance for any pointers!
>>>
>>> --Jay
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users@xxxxxxxxxxx
>>> http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users