stale+active+clean PG

tomislav.raseta@xxxxxxxxx · Thu, 14 May 2020 20:08:45 -0000

Dear all,

We're running Ceph Luminous and we've recently hit an issue with some OSD's (autoout states, IO/CPU overload) which unfortunately resulted with one placement group with the state "stale+active+clean", it's a placement group from .rgw.root pool:

1.15          0                  0        0         0       0          0     1        1                                stale+active+clean 2020-05-11 23:22:51.396288         40'1      2142:152 [3,2,6]          3 [3,2,6]              3        40'1 2020-04-22 00:46:05.904418            40'1 2020-04-20 20:18:13.371396             0 

I guess there is no active replica of that object anywhere on the cluster. Restarting osd.3, osd.2 or osd.6 daemons does not help.

I've used ceph-objectstore-tool and successfully exported placement group from osd.3, osd.2 and osd.6 and tried to import it on a completely different OSD, the exports differ in filesize slightly, but the osd.3 wihch was the latest primary is the biggest so I've tried to import it on a different OSD, when starting up I see the following (this is from osd.1):
2020-05-14 21:43:19.779740 7f7880ac3700  1 osd.1 pg_epoch: 2459 pg[1.15( v 40'1 (0'0,40'1] local-lis/les=2073/2074 n=0 ec=73/39 lis/c 2073/2073 les/c/f 2074/2074/633 2145/39/2145) [] r=-1 lpr=2455 crt=40'1 lcod 0'0 unknown NOTIFY] state<Start>: transitioning to Stray

I see from previous pg dumps (several weeks before while it was still active+clean) that it was 115 bytes with zero objects in it but I am not sure how to interpret that.

As this is a pg from .rgw.root pool, I cannot get any response from the cluster when accessing (everything timeouts).

What is the correct course of action with this pg?

Any help would be greatly appriciated.

Thanks,
Tomislav
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx