If you're using an Icinga active check that just looks for SMART overall-health self-assessment test result: PASSED then it's not doing much for you. That bivalue status can be shown for a drive that is decidedly an ex-parrot. Gotta look at specific attributes, which is thorny since they aren't consistently implemented. drivedb.h is a downright mess, which doesn't help. > > > > > ----- Le 12 Avr 24, à 15:17, Albert Shih Albert.Shih@xxxxxxxx a écrit : > >> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit >>> >> Hi, >> >>> >>> Have you check the hardware status of the involved drives other than with >>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for DELL >>> hardware for example). >> >> Yes, all my disk are «under» periodic check with smartctl + icinga. > > Actually, I meant lower level tools (drive / server vendor tools). > >> >>> If these tools don't report any media error (that is bad blocs on disks) then >>> you might just be facing the bit rot phenomenon. But this is very rare and >>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a >>> professional poker player's lifetime. ;-) >>> >>> If no media error is reported, then you might want to check and update the >>> firmware of all drives. >> >> You're perfectly right. >> >> It's just a newbie error, I check on the «main» osd of the PG (meaning the >> first in the list) but forget to check on other. >> > > Ok. > >> On when server I indeed get some error on a disk. >> >> But strangely smartctl report nothing. I will add a check with dmesg. > > That's why I pointed you to the drive / server vendor tools earlier as sometimes smartctl is missing the information you want. > >> >>> >>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have these >>> inconsistencies repaired automatically on deep-scrubbing, but make sure you're >>> using the alert module [1] so to at least get informed about the scrub errors. >> >> Thanks. I will look into because we got already icinga2 on site so I use >> icinga2 to check the cluster. >> >> Is they are a list of what the alert module going to check ? > > Basically the module checks for ceph status (ceph -s) changes. > > https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py > > Regards, > Frédéric. > >> >> >> Regards >> >> JAS >> -- >> Albert SHIH 🦫 🐸 >> France >> Heure locale/Local time: >> ven. 12 avril 2024 15:13:13 CEST > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx