OSDs stalling on Intel SSDs

Shawn Iverson <iversons@xxxxxxxxxxxxxxxxxxx> · Tue, 10 Jul 2018 11:48:59 -0400

Hi everybody,
I have a situation that occurs under moderate I/O load on Ceph Luminous:

2018-07-10 10:27:01.257916 mon.node4 mon.0 172.16.0.4:6789/0 15590 : cluster [INF] mon.node4 is new leader, mons node4,node5,node6,node7,node8 in quorum (ranks 0,1,2,3,4)
2018-07-10 10:27:01.306329 mon.node4 mon.0 172.16.0.4:6789/0 15595 : cluster [INF] Health check cleared: MON_DOWN (was: 1/5 mons down, quorum node4,node6,node7,node8)
2018-07-10 10:27:01.386124 mon.node4 mon.0 172.16.0.4:6789/0 15596 : cluster [WRN] overall HEALTH_WARN 1 osds down; Reduced data availability: 1 pg peering; Degraded data redundancy: 58774/10188798 objects degraded (0.577%), 13 pgs degraded; 412 slow requests are blocked > 32 sec
2018-07-10 10:27:02.598175 mon.node4 mon.0 172.16.0.4:6789/0 15597 : cluster [WRN] Health check update: Degraded data redundancy: 77153/10188798 objects degraded (0.757%), 17 pgs degraded (PG_DEGRADED)
2018-07-10 10:27:02.598225 mon.node4 mon.0 172.16.0.4:6789/0 15598 : cluster [WRN] Health check update: 381 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-10 10:27:02.598264 mon.node4 mon.0 172.16.0.4:6789/0 15599 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2018-07-10 10:27:02.608006 mon.node4 mon.0 172.16.0.4:6789/0 15600 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-07-10 10:27:02.701029 mon.node4 mon.0 172.16.0.4:6789/0 15601 : cluster [INF] osd.36 172.16.0.5:6800/3087 boot
2018-07-10 10:27:01.184334 osd.36 osd.36 172.16.0.5:6800/3087 23 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2018-07-10 10:27:04.861372 mon.node4 mon.0 172.16.0.4:6789/0 15604 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 381 slow requests are blocked > 32 sec)

The OSDs that seem to be affected are Intel SSDs, specific model is SSDSC2BX480G4L.

I have throttled backups to try to lessen the situation, but it seems to affect the same OSDs when it happens.  It has the added side effect of taking down the mon on the same node for a few seconds and triggering a monitor election.

I am wondering if this may be a firmware issue on this drive and if anyone has any insight or additional troubleshooting steps I should try to get a deeper look at this behavior.

I am going to upgrade firmware on these drives and see if it helps.

-- 
Shawn Iverson, CETLDirector of Technology
Rush County Schools
765-932-3901 x1171
iversons@xxxxxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com