Re: [ERR] OSD_SCRUB_ERRORS: 2 scrub errors

Konstantin Shalygin <k0ste@xxxxxxxx> · Wed, 11 Jan 2023 17:26:41 +0700

Hi,

> On 10 Jan 2023, at 07:10, David Orman <ormandj@xxxxxxxxxxxx> wrote:
> 
> We ship all of this to our centralized monitoring system (and a lot more) and have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph in production, I believe host-level monitoring is critical, above and beyond Ceph level. Things like inlet/outlet temperature, hardware state of various components, and various other details are probably best served by monitoring external to Ceph itself.

I agree with David's suggestions

> 
> I did a quick glance and didn't see this data (OSD errors re: reads/writes) exposed in the Pacific version of Ceph's Prometheus-style exporter, but I may have overlooked it. This would be nice to have, as well, if it does not exist.
> 
> We collect drive counters at the host level, and alert at levels prior to general impact. Even a failing drive can cause latency spikes which are frustrating, before it starts returning errors (correctable errors) - the OSD will not see these other than longer latency on operations. Seeing a change in the smart counters either at a high rate or above thresholds you define is most certainly something I would suggest ensuring is covered in whatever host-level monitoring you're already performing for production usage.

Seems to me that there is no need to reinvent the wheel and create even more GIL problems for ceph-mgr. In previous year was released production-ready exporter for smartctl data, with NVMe support [1]
Golang, CI & tested in production with Ceph - ready to go 🙂

[1] https://github.com/prometheus-community/smartctl_exporter
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx