Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> · Fri, 19 Nov 2021 10:57:06 -0500

We have stopped deepscrubbing a while ago. However, forcing a deepscrub 
by doing "ceph pg deep-scrub 6.180" doesn't do anything. The deepscrub 
doesn't run at all. Could the deepscrubbing process be stuck elsewhere?

On 11/18/21 3:29 PM, Wesley Dillingham wrote:
That response is typically indicative of a pg whose OSD sets has 
changed since it was last scrubbed (typically from a disk failing).

Are you sure its actually getting scrubbed when you issue the scrub? 
For example you can issue: "ceph pg <pg_id> query" and look for 
"last_deep_scrub_stamp" which will tell you when it was last deep 
scrubbed.

Further, in sufficiently recent versions of Ceph (introduced in 
14.2.something iirc) setting the flag "nodeep-scrub" will cause all in 
flight deep-scrubs to stop immediately. You may have a scheduling 
issue where you deep-scrub or repairs arent getting scheduled.

Set the nodeep-scrub flag: "ceph osd set nodeep-scrub" and wait for 
all current deep-scrubs to complete then try and manually re-issue the 
deep scrub "ceph pg deep-scrub <pg_id>" at this point your scrub 
should start near immediately and "rados
list-inconsistent-obj 6.180 --format=json-pretty" should return with 
something of value.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx <mailto:wes@xxxxxxxxxxxxxxxxx>
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Thu, Nov 18, 2021 at 2:38 PM J-P Methot 
<jp.methot@xxxxxxxxxxxxxxxxx <mailto:jp.methot@xxxxxxxxxxxxxxxxx>> wrote:

    Hi,

    We currently have a PG stuck in an inconsistent state on an erasure
    coded pool. The pool's K and M values are 33 and 3.  The command
    rados
    list-inconsistent-obj 6.180 --format=json-pretty results in the
    following error:

    No scrub information available for pg 6.180 error 2: (2) No such
    file or
    directory

    Forcing a deep scrub of the pg does not fix this. Doing a ceph pg
    repair
    6.180 doesn't seem to do anything. Is there a known bug explaining
    this
    behavior? I am attaching informations regarding the PG in question.

    -- 
    Jean-Philippe Méthot
    Senior Openstack system administrator
    Administrateur système Openstack sénior
    PlanetHoster inc.

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx