Weird pg degradation behavior

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Wed, 4 Dec 2024 08:32:44 +0000

Hi,

I lost a node due to cpu failure so until hardware fix, I let it down as Ceph marked osds out, however I keep it in the tree because in 1-2 days will be online again.

This is the current state:
    health: HEALTH_WARN                        Degraded data redundancy: 1726/23917349718 objects degraded(0.000%), 1 pg degraded, 1 pg undersized

Out of the 9x osd nodes 2x of them have 4x nvme osd (temporarily) and 7x has only 2x (for index pool on 3 replicas) where this pg is located.
(The down osd node is a 2x nvme osd one)

This is some part of the query of that specific degraded pg:
...
 "up": [
     233,
     202
 ],
 "acting": [
     233,
     202
 ],
 "avail_no_missing": [
     "233",
     "202"
 ],
...

The above mentioned osds are on the 2x osd nodes with 4x nvme.

What is not clear, I don't see any probing osd or any indicator for the 3rd missing part of that specific pg.

This is the pgs_brief:
PG_STAT  STATE                        UP                         UP_PRIMARY  ACTING                     ACTING_PRIMARY
10.f6     active+undersized+degraded                  [233,202]         233                  [233,202]             233

Shouldn't have any indicator what it is looking for?
Which osd or some probing osd? (It's quincy 17.2.7).

If I purge the down osds on the down server,  I guess it would kick off recovery, I'm just curious why don't show anywhere what it is looking for?

Ty
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx