OSDs stalling on Intel SSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everybody,

I have a situation that occurs under moderate I/O load on Ceph Luminous:

2018-07-10 10:27:01.257916 mon.node4 mon.0 172.16.0.4:6789/0 15590 : cluster [INF] mon.node4 is new leader, mons node4,node5,node6,node7,node8 in quorum (ranks 0,1,2,3,4)
2018-07-10 10:27:01.306329 mon.node4 mon.0 172.16.0.4:6789/0 15595 : cluster [INF] Health check cleared: MON_DOWN (was: 1/5 mons down, quorum node4,node6,node7,node8)
2018-07-10 10:27:01.386124 mon.node4 mon.0 172.16.0.4:6789/0 15596 : cluster [WRN] overall HEALTH_WARN 1 osds down; Reduced data availability: 1 pg peering; Degraded data redundancy: 58774/10188798 objects degraded (0.577%), 13 pgs degraded; 412 slow requests are blocked > 32 sec
2018-07-10 10:27:02.598175 mon.node4 mon.0 172.16.0.4:6789/0 15597 : cluster [WRN] Health check update: Degraded data redundancy: 77153/10188798 objects degraded (0.757%), 17 pgs degraded (PG_DEGRADED)
2018-07-10 10:27:02.598225 mon.node4 mon.0 172.16.0.4:6789/0 15598 : cluster [WRN] Health check update: 381 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-10 10:27:02.598264 mon.node4 mon.0 172.16.0.4:6789/0 15599 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering)
2018-07-10 10:27:02.608006 mon.node4 mon.0 172.16.0.4:6789/0 15600 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-07-10 10:27:02.701029 mon.node4 mon.0 172.16.0.4:6789/0 15601 : cluster [INF] osd.36 172.16.0.5:6800/3087 boot
2018-07-10 10:27:01.184334 osd.36 osd.36 172.16.0.5:6800/3087 23 : cluster [WRN] Monitor daemon marked osd.36 down, but it is still running
2018-07-10 10:27:04.861372 mon.node4 mon.0 172.16.0.4:6789/0 15604 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 381 slow requests are blocked > 32 sec)

The OSDs that seem to be affected are Intel SSDs, specific model is SSDSC2BX480G4L.

I have throttled backups to try to lessen the situation, but it seems to affect the same OSDs when it happens.  It has the added side effect of taking down the mon on the same node for a few seconds and triggering a monitor election.

I am wondering if this may be a firmware issue on this drive and if anyone has any insight or additional troubleshooting steps I should try to get a deeper look at this behavior.

I am going to upgrade firmware on these drives and see if it helps.

--
Shawn Iverson, CETL
Director of Technology
Rush County Schools
765-932-3901 x1171


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux