Re: How you handle failing/slow disks?

Alex Litvak <alexander.v.litvak@xxxxxxxxx> · Thu, 22 Nov 2018 15:13:19 -0600

Sorry for hijacking a thread but do you have an idea of what to watch for:

I monitor admin sockets of osds and occasionally I see a burst of both op_w_process_latency and op_w_latency to near 150 - 200 ms on 7200 SAS enterprise drives.
For example load average on the node jumps up with idle 97 % CPU and I see that out of 12 OSDs probably have latency of op_w_latency 170 - 180 ms and 3 more have latency of ~ 120 - 130 ms and the rest 
100 ms or below.  Does it say anything regarding possible drive failure (I am running drives inside of Dell PowerVault MD3400 and the storage unit shows them all green OK)?  Unfortunately, smartmon 
outside of box tells me nothing other then health is OK.

High load usually corresponds with when the op_w_latency affects multiple OSDs (4 or more) at the same time.

On 11/21/2018 10:26 AM, Paul Emmerich wrote:
Yeah, we also observed problems with HP raid controllers misbehaving
when a single disk starts to fail. We would never recommend building a
Ceph cluster on HP raid controllers until they can fix that issue.

There are several features in Ceph which detect dead disks: there are
timeouts for OSDs checking each other and there's a timeout for OSDs
checking in with the mons. But that's usually not enough in this
scenario. The good news is that recent Ceph versions will show which
OSDs are implicated in slow requests (check ceph health detail) which
at least gives you some way to figure out which OSDs are becoming
slow.

We have found it to be useful to monitor the op_*_latency values of
all OSDs (especially subop latencies) from the admin daemon to detect
such failures earlier.

Paul

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com