PG does not become active

Frank Schilder <frans@xxxxxx> · Wed, 27 Jul 2022 15:19:05 +0000

I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs of this pool 2 (!!!) shards are missing. This most degraded PG is not becoming active, its stuck inactive but peered.

Questions:

- How can 2 OSDs be missing if only 1 OSD is down?
- Wasn't there an important code change to allow recovery for an EC PG with at
  least k shards present even if min_size>k? Do I have to set something?
- If the PG should recover, why is it not prioritised considering its severe degradation
  compared with all other PGs?

I have already increased these crush tunables and executed a pg repeer to no avail:

tunable choose_total_tries 250 <-- default 100
rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 50 <-- default 5
        step set_choose_tries 200 <-- default 100
        step take default
        step choose indep 0 type osd
        step emit
}

Ceph health detail says to that:

[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
    pg 4.32 is stuck inactive for 37m, current state recovery_wait+undersized+degraded+remapped+peered, last acting [1,2147483647,2147483647,4,5,2]

I don't want to cheat and set min_size=k on this pool. It should work by itself.

Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx