Re: PG inconsistent

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Fri, 12 Apr 2024 10:35:35 -0400

If you're using an Icinga active check that just looks for 

SMART overall-health self-assessment test result: PASSED

then it's not doing much for you.  That bivalue status can be shown for a drive that is decidedly an ex-parrot.  Gotta look at specific attributes, which is thorny since they aren't consistently implemented.  drivedb.h is a downright mess, which doesn't help.

> 
> 
> 
> 
> ----- Le 12 Avr 24, à 15:17, Albert Shih Albert.Shih@xxxxxxxx a écrit :
> 
>> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
>>> 
>> Hi,
>> 
>>> 
>>> Have you check the hardware status of the involved drives other than with
>>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for DELL
>>> hardware for example).
>> 
>> Yes, all my disk are «under» periodic check with smartctl + icinga.
> 
> Actually, I meant lower level tools (drive / server vendor tools).
> 
>> 
>>> If these tools don't report any media error (that is bad blocs on disks) then
>>> you might just be facing the bit rot phenomenon. But this is very rare and
>>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
>>> professional poker player's lifetime. ;-)
>>> 
>>> If no media error is reported, then you might want to check and update the
>>> firmware of all drives.
>> 
>> You're perfectly right.
>> 
>> It's just a newbie error, I check on the «main» osd of the PG (meaning the
>> first in the list) but forget to check on other.
>> 
> 
> Ok.
> 
>> On when server I indeed get some error on a disk.
>> 
>> But strangely smartctl report nothing. I will add a check with dmesg.
> 
> That's why I pointed you to the drive / server vendor tools earlier as sometimes smartctl is missing the information you want.
> 
>> 
>>> 
>>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have these
>>> inconsistencies repaired automatically on deep-scrubbing, but make sure you're
>>> using the alert module [1] so to at least get informed about the scrub errors.
>> 
>> Thanks. I will look into because we got already icinga2 on site so I use
>> icinga2 to check the cluster.
>> 
>> Is they are a list of what the alert module going to check ?
> 
> Basically the module checks for ceph status (ceph -s) changes.
> 
> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py
> 
> Regards,
> Frédéric.
> 
>> 
>> 
>> Regards
>> 
>> JAS
>> --
>> Albert SHIH 🦫 🐸
>> France
>> Heure locale/Local time:
>> ven. 12 avril 2024 15:13:13 CEST
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx