Hi, > On 10 Jan 2023, at 07:10, David Orman <ormandj@xxxxxxxxxxxx> wrote: > > We ship all of this to our centralized monitoring system (and a lot more) and have dashboards/proactive monitoring/alerting with 100PiB+ of Ceph. If you're running Ceph in production, I believe host-level monitoring is critical, above and beyond Ceph level. Things like inlet/outlet temperature, hardware state of various components, and various other details are probably best served by monitoring external to Ceph itself. I agree with David's suggestions > > I did a quick glance and didn't see this data (OSD errors re: reads/writes) exposed in the Pacific version of Ceph's Prometheus-style exporter, but I may have overlooked it. This would be nice to have, as well, if it does not exist. > > We collect drive counters at the host level, and alert at levels prior to general impact. Even a failing drive can cause latency spikes which are frustrating, before it starts returning errors (correctable errors) - the OSD will not see these other than longer latency on operations. Seeing a change in the smart counters either at a high rate or above thresholds you define is most certainly something I would suggest ensuring is covered in whatever host-level monitoring you're already performing for production usage. Seems to me that there is no need to reinvent the wheel and create even more GIL problems for ceph-mgr. In previous year was released production-ready exporter for smartctl data, with NVMe support [1] Golang, CI & tested in production with Ceph - ready to go 🙂 [1] https://github.com/prometheus-community/smartctl_exporter _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx