Re: monitoring drives

Ernesto Puerta <epuertat@xxxxxxxxxx> · Mon, 17 Oct 2022 12:52:20 +0200

I see a few (a priori) potential issues with this:

   - Given "disks" is THE key scaling dimension in a Ceph cluster,
   depending on how many metrics per device this exporter generates, it could
   negatively impact Prometheus performance (we already experienced such an
   issue when we explored adding cAdvisor support... and discarded that).
   - Depending on the type of smartctl testing, that might interfere with
   the IO load (after checking the actual metrics exported
   <https://github.com/prometheus-community/smartctl_exporter/blob/master/metrics.go>,
   that doesn't seem to be the case),
   - Ceph already exposes SMART-based health-checks, metrics and alerts
   from the devicehealth/diskprediction modules
   <https://docs.ceph.com/en/latest/rados/operations/devices/#enabling-monitoring>.
   I find this kind of high-level monitoring more digestible to operators than
   low-level SMART metrics.

Kind Regards,
Ernesto

On Fri, Oct 14, 2022 at 9:31 PM Fox, Kevin M <Kevin.Fox@xxxxxxxx> wrote:

> Would it cause problems to mix the smartctl exporter along with ceph's
> built in monitoring stuff?
>
> Thanks,
> Kevin
>
> ________________________________________
> From: Wyll Ingersoll <wyllys.ingersoll@xxxxxxxxxxxxxx>
> Sent: Friday, October 14, 2022 10:48 AM
> To: Konstantin Shalygin; John Petrini
> Cc: Marc; Paul Mezzanini; ceph-users
> Subject:  Re: monitoring drives
>
> Check twice before you click! This email originated from outside PNNL.
>
>
> This looks very useful.  Has anyone created a grafana dashboard that will
> display the collected data ?
>
>
> ________________________________
> From: Konstantin Shalygin <k0ste@xxxxxxxx>
> Sent: Friday, October 14, 2022 12:12 PM
> To: John Petrini <jpetrini@xxxxxxxxxxxx>
> Cc: Marc <Marc@xxxxxxxxxxxxxxxxx>; Paul Mezzanini <pfmeec@xxxxxxx>;
> ceph-users <ceph-users@xxxxxxx>
> Subject:  Re: monitoring drives
>
> Hi,
>
> You can get this metrics, even wear level, from official smartctl_exporter
> [1]
>
> [1]
> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprometheus-community%2Fsmartctl_exporter&amp;data=05%7C01%7Ckevin.fox%40pnnl.gov%7C427caf0d5bb141698e2c08daae0c89bc%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638013666131743069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=qo1e3pnVlv7ILn6%2FN7Ojh7j8dB9pThI0g%2F56%2F66wdbM%3D&amp;reserved=0
>
> k
> Sent from my iPhone
>
> > On 14 Oct 2022, at 17:12, John Petrini <jpetrini@xxxxxxxxxxxx> wrote:
> >
> > We run a mix of Samsung and Intel SSD's, our solution was to write a
> > script that parses the output of the Samsung SSD Toolkit and Intel
> > ISDCT CLI tools respectively. In our case, we expose those metrics
> > using node_exporter's textfile collector for ingestion by prometheus.
> > It's mostly the same smart data but it helps identify some vendor
> > specific smart metrics, namely SSD wear level, that we were unable to
> > decipher from the raw smart data.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx