Re: monitoring drives

Marc <Marc@xxxxxxxxxxxxxxxxx> · Fri, 14 Oct 2022 13:39:41 +0000

> smartctl can very much read sas drives so I would look into that chain
> first.

I have smartd running and it does recognize the sas drives, however I have collectd is grabbing smart data and I am getting nothing from it. This is all the stuff I am getting from a sata drive

# SELECT * FROM "smart_value" WHERE "host"='c01' AND "instance"='sdb' AND time>=now()-60m limit 50
name: smart_value
time                           host instance type              value
----                           ---- -------- ----              -----
2022-10-14T13:24:04.029043881Z c01  sdb      smart_poweron     118652400
2022-10-14T13:24:04.043975567Z c01  sdb      smart_powercycles 8
2022-10-14T13:24:04.05828545Z  c01  sdb      smart_badsectors  0
2022-10-14T13:24:04.07207858Z  c01  sdb      smart_temperature 30
> SELECT * FROM "smart_pretty" WHERE "host"='c01' AND "instance"='sdb' AND time>=now()-60m limit 50
name: smart_pretty
time                           host instance type            type_instance            value
----                           ---- -------- ----            -------------            -----
2022-10-14T13:24:04.072900793Z c01  sdb      smart_attribute raw-read-error-rate      0
2022-10-14T13:24:04.073731474Z c01  sdb      smart_attribute spin-up-time             5383
2022-10-14T13:24:04.074562994Z c01  sdb      smart_attribute start-stop-count         8
2022-10-14T13:24:04.075397312Z c01  sdb      smart_attribute reallocated-sector-count 0
2022-10-14T13:24:04.07624241Z  c01  sdb      smart_attribute seek-error-rate          0
2022-10-14T13:24:04.077058461Z c01  sdb      smart_attribute power-on-hours           118652400000
2022-10-14T13:24:04.077886085Z c01  sdb      smart_attribute spin-retry-count         0
2022-10-14T13:24:04.078708091Z c01  sdb      smart_attribute calibration-retry-count  0
2022-10-14T13:24:04.079542614Z c01  sdb      smart_attribute power-cycle-count        8
2022-10-14T13:24:04.080374422Z c01  sdb      smart_attribute power-off-retract-count  6
2022-10-14T13:24:04.0812049Z   c01  sdb      smart_attribute load-cycle-count         74
2022-10-14T13:24:04.082027399Z c01  sdb      smart_attribute temperature-celsius-2    303150
2022-10-14T13:24:04.082879593Z c01  sdb      smart_attribute reallocated-event-count  0
2022-10-14T13:24:04.083707815Z c01  sdb      smart_attribute current-pending-sector   0
2022-10-14T13:24:04.084536779Z c01  sdb      smart_attribute offline-uncorrectable    0
2022-10-14T13:24:04.085365242Z c01  sdb      smart_attribute udma-crc-error-count     0
2022-10-14T13:24:04.086191201Z c01  sdb      smart_attribute multi-zone-error-rate    0

>   Are they behind a raid controller that is masking the smart
> commands?

No

> As for monitoring, we run the smartd service to keep an eye on drives.
> More often than not I notice weird things with ceph long before smart
> throws an actual error.  Bouncing drives, oddly high latency on our "Max
> OSD Apply Latency" graph. 

Do you only grab one metric in the query or do you also 'calculate' if the disk currently is being used and compensate for that in the reported latency. (Or is this metric not depending on current use?)

What values should I look for, how many hundreds of ms?

I have 106 metrics listed in ceph_latency. These start with osd, what would be the apply latency one?

Osd.opBeforeDequeueOpLat
Osd.opBeforeQueueOpLat
Osd.opLatency
Osd.opPrepareLatency
Osd.opProcessLatency
Osd.opRLatency
Osd.opRPrepareLatency
Osd.opRProcessLatency
Osd.opRwLatency
Osd.opRwPrepareLatency
Osd.opRwProcessLatency
Osd.opWLatency
Osd.opWPrepareLatency
Osd.opWProcessLatency
Osd.subopLatency
Osd.subopWLatency

>  Every few months I throw a smart long test
> at the whole cluster and a few days later go back and rake the results.
> Anything that has a failure gets immediately removed from ceph by me
> regardless if smart says it's fine or not.   At least 90% of the drives
> we RMA have smart passed but failures in the read test.  Never had
> pushback from WDC or Seagate on it.
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx