> smartctl can very much read sas drives so I would look into that chain > first. I have smartd running and it does recognize the sas drives, however I have collectd is grabbing smart data and I am getting nothing from it. This is all the stuff I am getting from a sata drive # SELECT * FROM "smart_value" WHERE "host"='c01' AND "instance"='sdb' AND time>=now()-60m limit 50 name: smart_value time host instance type value ---- ---- -------- ---- ----- 2022-10-14T13:24:04.029043881Z c01 sdb smart_poweron 118652400 2022-10-14T13:24:04.043975567Z c01 sdb smart_powercycles 8 2022-10-14T13:24:04.05828545Z c01 sdb smart_badsectors 0 2022-10-14T13:24:04.07207858Z c01 sdb smart_temperature 30 > SELECT * FROM "smart_pretty" WHERE "host"='c01' AND "instance"='sdb' AND time>=now()-60m limit 50 name: smart_pretty time host instance type type_instance value ---- ---- -------- ---- ------------- ----- 2022-10-14T13:24:04.072900793Z c01 sdb smart_attribute raw-read-error-rate 0 2022-10-14T13:24:04.073731474Z c01 sdb smart_attribute spin-up-time 5383 2022-10-14T13:24:04.074562994Z c01 sdb smart_attribute start-stop-count 8 2022-10-14T13:24:04.075397312Z c01 sdb smart_attribute reallocated-sector-count 0 2022-10-14T13:24:04.07624241Z c01 sdb smart_attribute seek-error-rate 0 2022-10-14T13:24:04.077058461Z c01 sdb smart_attribute power-on-hours 118652400000 2022-10-14T13:24:04.077886085Z c01 sdb smart_attribute spin-retry-count 0 2022-10-14T13:24:04.078708091Z c01 sdb smart_attribute calibration-retry-count 0 2022-10-14T13:24:04.079542614Z c01 sdb smart_attribute power-cycle-count 8 2022-10-14T13:24:04.080374422Z c01 sdb smart_attribute power-off-retract-count 6 2022-10-14T13:24:04.0812049Z c01 sdb smart_attribute load-cycle-count 74 2022-10-14T13:24:04.082027399Z c01 sdb smart_attribute temperature-celsius-2 303150 2022-10-14T13:24:04.082879593Z c01 sdb smart_attribute reallocated-event-count 0 2022-10-14T13:24:04.083707815Z c01 sdb smart_attribute current-pending-sector 0 2022-10-14T13:24:04.084536779Z c01 sdb smart_attribute offline-uncorrectable 0 2022-10-14T13:24:04.085365242Z c01 sdb smart_attribute udma-crc-error-count 0 2022-10-14T13:24:04.086191201Z c01 sdb smart_attribute multi-zone-error-rate 0 > Are they behind a raid controller that is masking the smart > commands? No > As for monitoring, we run the smartd service to keep an eye on drives. > More often than not I notice weird things with ceph long before smart > throws an actual error. Bouncing drives, oddly high latency on our "Max > OSD Apply Latency" graph. Do you only grab one metric in the query or do you also 'calculate' if the disk currently is being used and compensate for that in the reported latency. (Or is this metric not depending on current use?) What values should I look for, how many hundreds of ms? I have 106 metrics listed in ceph_latency. These start with osd, what would be the apply latency one? Osd.opBeforeDequeueOpLat Osd.opBeforeQueueOpLat Osd.opLatency Osd.opPrepareLatency Osd.opProcessLatency Osd.opRLatency Osd.opRPrepareLatency Osd.opRProcessLatency Osd.opRwLatency Osd.opRwPrepareLatency Osd.opRwProcessLatency Osd.opWLatency Osd.opWPrepareLatency Osd.opWProcessLatency Osd.subopLatency Osd.subopWLatency > Every few months I throw a smart long test > at the whole cluster and a few days later go back and rake the results. > Anything that has a failure gets immediately removed from ceph by me > regardless if smart says it's fine or not. At least 90% of the drives > we RMA have smart passed but failures in the read test. Never had > pushback from WDC or Seagate on it. > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx