Hello,
I run a ceph Nautilus 14.2.22 cluster with 144 OSDs. In order to be able
to see if a disk has hardware trouble and might fail soon I activated
health management. The cluster is running on Ubuntu 18.04 and the first
task was to install a newer smartctl version. I used smartctl 7.0.
Device monitoring ist activated (ceph device monitoring on). Using ceph
device get-health-metrics <device ID> I see the results of smartctl runs
for the device with the given ID like this:
....
"product": "ST4000NM0295",
"revision": "DT31",
"rotation_rate": 7200,
"scsi_error_counter_log": {
"read": {
"correction_algorithm_invocations": 20,
"errors_corrected_by_eccdelayed": 20,
"errors_corrected_by_eccfast": 3457558131,
....
So this seems to run just fine. For failure prediction I selected the
"local" method (ceph config set global device_failure_prediction_mode
local).
Whats missing for me is the prediction output in ceph device ls. The
column "LIFE EXPECTANCY" is always empty and I have no idea why:
# ceph device ls
DEVICE HOST:DEV DAEMONS LIFE EXPECTANCY
SEAGATE_ST4000NM017A_WS23WKJ4 ceph4:sdb osd.49
SEAGATE_ST4000NM0295_ZC13XK9P ceph6:sdo osd.92
SEAGATE_ST4000NM0295_ZC141B3S ceph6:sdj osd.89
....
Anyone an idea what might be missing in my setup? Is the "LIFE
EXPECTANCY" perhaps only populated if the local predictor predicts a
failure or should I find something like "good" there if the disk is ok
for the moment? Recently I even had a disk that died but I did not see
anything in ceph-device ls for the died OSD-disk. So I am really unsure
if failure prediction is working at all on my ceph system?
Thanks
Rainer
--
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx