Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

"David Orman" <ormandj@xxxxxxxxxxxx> · Mon, 09 Jan 2023 18:10:17 -0600

We ship all of this to our centralized monitoring system (and a lot more) and have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph in production, I believe host-level monitoring is critical, above and beyond Ceph level. Things like inlet/outlet temperature, hardware state of various components, and various other details are probably best served by monitoring external to Ceph itself.

I did a quick glance and didn't see this data (OSD errors re: reads/writes) exposed in the Pacific version of Ceph's Prometheus-style exporter, but I may have overlooked it. This would be nice to have, as well, if it does not exist.

We collect drive counters at the host level, and alert at levels prior to general impact. Even a failing drive can cause latency spikes which are frustrating, before it starts returning errors (correctable errors) - the OSD will not see these other than longer latency on operations. Seeing a change in the smart counters either at a high rate or above thresholds you define is most certainly something I would suggest ensuring is covered in whatever host-level monitoring you're already performing for production usage.

David

On Mon, Jan 9, 2023, at 17:46, Erik Lindahl wrote:
> Hi,
> 
> Good points; however, given that ceph already collects all this statistics, isn't  there any way to set (?) reasonable thresholds and actually have ceph detect the amount of read errors and suggest that a given drive should be replaced?
> 
> It seems a bit strange that we all should have to wait for a PG read error, then log into each node to check the number of read errors for each device and keep track of this?  Of course it's possible to write scripts for everything, but there must be numerous Ceph sites with hundreds of OSD nodes, so I'm a bit surprised this isn't more automated...
> 
> Cheers,
> 
> Erik
> 
> --
> Erik Lindahl <erik.lindahl@xxxxxxxxx>
> On 10 Jan 2023 at 00:09 +0100, Anthony D'Atri <aad@xxxxxxxxxxxxxx>, wrote:
> >
> >
> > > On Jan 9, 2023, at 17:46, David Orman <ormandj@xxxxxxxxxxxx> wrote:
> > >
> > > It's important to note we do not suggest using the SMART "OK" indicator as the drive being valid. We monitor correctable/uncorrectable error counts, as you can see a dramatic rise when the drives start to fail. 'OK' will be reported for SMART health long after the drive is throwing many uncorrectable errors and needs replacement. You have to look at the actual counters, themselves.
> >
> > I strongly agree, especially given personal experience with SSD firmware design flaws.
> >
> > Also, examining UDMA / CRC error rates led to the discovery that certain aftermarket drive carriers had lower tolerances than those from the chassis vendor, resulting in drives that were silently slow. Reseating in most cases restored performance.
> >
> > — aad
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx