Failed Disk simulation question

Alex Litvak <alexander.v.litvak@xxxxxxxxx> · Tue, 21 May 2019 15:15:02 -0500

Hello cephers,

I know that there was similar question posted 5 years ago.  However the answer was inconclusive for me.
I installed a new Nautilus 14.2.1 cluster and started pre-production testing.  I followed RedHat document and simulated a soft disk failure by

#  echo 1 > /sys/block/sdc/device/delete

The cluster has been idle at the moment being new and all.  I noticed some disk related errors in dmesg but that was about it.
It looked to me for the next 20 - 30 minutes the failure has not been detected.  All osds were up and in and health was OK. OSD logs had no smoking gun either.
After 30 minutes, I restarted the OSD container and it failed to start as expected.

Later on, I performed the same operation during the fio bench mark and OSD failed immediately.

My question is:  Should the disk problem have been detected quick enough even on the idle cluster? I thought Nautilus has the means to sense failure before intensive IO hit the disk.
Am I wrong to expect that?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com