Re: ceph osd purge => 4 PGs stale+undersized+degraded+peered

Rafael Diaz Maurin <Rafael.DiazMaurin@xxxxxxxxxxxxxxx> · Tue, 18 Jan 2022 13:27:36 +0100

Hello,

My 4 PGs are now active !
David Casier from aevoo.fr has succeeded in mounting the OSD I had 
recently purged.

The problem seems to be a bug in the number of retries in the Ceph crushmap.
In fact, I thought my PGs were replicated in 3 rooms, but it was not the 
case.
I run ceph 15.2.15

Here is the crushmap which do not had replicated the PGs over the 3 
rooms (bug ??) :
rule 3replicats3sites_rule {
    id 2
    type replicated
    min_size 2
    max_size 3
    step take default
    step choose firstn 3 type room
    step chooseleaf firstn 4 type host
    step emit
}

And here is the rule which has been corrected, and now distributes 
efficiently the PGs over the 3 rooms :
rule 3replicats3sites_rule {
    id 2
    type replicated
    min_size 2
    max_size 3
    step take default
    step choose firstn 0 type room
    step chooseleaf firstn 1 type host
    step emit
}

Thank you for your answers.

Rafael

Le 17/01/2022 à 15:24, Rafael Diaz Maurin a écrit :
Hello,

All my pools on the cluster are replicated (x3).

I purged some OSD (after I stopped them) and remove the disks from the 
servers, and now I have 4 PGs in stale+undersized+degraded+peered.

Reduced data availability: 4 pgs inactive, 4 pgs stale

pg 1.561 is stuck stale for 39m, current state 
stale+undersized+degraded+peered, last acting [64]
pg 1.af2 is stuck stale for 39m, current state 
stale+undersized+degraded+peered, last acting [63]
pg 3.3 is stuck stale for 39m, current state 
stale+undersized+degraded+peered, last acting [48]
pg 9.5ca is stuck stale for 38m, current state 
stale+undersized+degraded+peered, last acting [49]

Those 4 OSDs have been purged, so ther aren't anymore in the crushmap.

I tried a pg repair :
ceph pg repair 1.561
ceph pg repair 1.af2
ceph pg repair 3.3
ceph pg repair 9.5ca

The PGs are remapped but non of the degraded objects have been repaired

ceph pg map 9.5ca
osdmap e355782 pg 9.5ca (9.5ca) -> up [54,75,82] acting [54,75,82]
ceph pg map 3.3
osdmap e355782 pg 3.3 (3.3) -> up [179,180,107] acting [179,180,107]
ceph pg map 1.561
osdmap e355785 pg 1.561 (1.561) -> up [70,188,87] acting [70,188,87]
ceph pg map 1.af2
osdmap e355789 pg 1.af2 (1.af2) -> up [189,74,184] acting [189,74,184]

How can I succeed in reparing my 4 PGs ?

This affect the cephfs-metadata pool, and the filesystem is degraded 
because the rank0 mds node stuck in rejoin state.

Thank you.

Rafael

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Rafael Diaz Maurin
DSI de l'Université de Rennes 1
Pôle Infrastructures, équipe Systèmes
02 23 23 71 57

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx