Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

Erik Lindahl <erik.lindahl@xxxxxxxxx> · Tue, 10 Jan 2023 00:46:23 +0100

Hi,

Good points; however, given that ceph already collects all this statistics, isn't  there any way to set (?) reasonable thresholds and actually have ceph detect the amount of read errors and suggest that a given drive should be replaced?

It seems a bit strange that we all should have to wait for a PG read error, then log into each node to check the number of read errors for each device and keep track of this?  Of course it's possible to write scripts for everything, but there must be numerous Ceph sites with hundreds of OSD nodes, so I'm a bit surprised this isn't more automated...

Cheers,

Erik

--
Erik Lindahl <erik.lindahl@xxxxxxxxx>
On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri <aad@xxxxxxxxxxxxxx>, wrote:
>
>
> > On Jan 9, 2023, at 17:46, David Orman <ormandj@xxxxxxxxxxxx> wrote:
> >
> > It's important to note we do not suggest using the SMART "OK" indicator as the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a dramatic rise when the drives start to fail. 'OK' will be reported for SMART health long after the drive is throwing many uncorrectable errors and needs replacement. You have to look at the actual counters, themselves.
>
> I strongly agree, especially given personal experience with SSD firmware design flaws.
>
> Also, examining UDMA / CRC error rates led to the discovery that certain aftermarket drive carriers had lower tolerances than those from the chassis vendor, resulting in drives that were silently slow. Reseating in most cases restored performance.
>
> — aad
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx