Re: PG does not become active

Frank Schilder <frans@xxxxxx> · Mon, 1 Aug 2022 13:03:06 +0000

I managed to reproduce the problem. I filed a tracker item: https://tracker.ceph.com/issues/56995

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 28 July 2022 12:45:41
To: Jesper Lykkegaard Karlsen
Cc: ceph-users@xxxxxxx
Subject:  Re: PG does not become active

Hi Jesper,

thanks for looking at this. The failure domain is OSD and not host. I typed it wrong in the text, the copy of the crush rule shows it right: step choose indep 0 type osd.

I'm trying to reproduce the observation to file a tracker item, but it is more difficult than expected. It might be a race condition, so far I didn't see it again. I hope I can figure out when and why this is happening.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jesper Lykkegaard Karlsen <jelka@xxxxxxxxx>
Sent: 28 July 2022 12:02:51
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  PG does not become active

Hi Frank,

I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host.

I do not know how it was possible for you to create that configuration at first?
Could it be that you have multiple name for the OSD hosts?
That would at least explain the one OSD down, being show as two OSDs down.

Also, I believe that min_size should never be smaller than “coding” shards, which is 4 in this case.

You can either make a new test setup with your three test OSD hosts using EC 2+1 or make e.g. 4+2, but with failure domain set to OSD.

Best,
Jesper

--------------------------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: jelka@xxxxxxxxx
Tlf:    +45 50906203

> On 27 Jul 2022, at 17.32, Frank Schilder <frans@xxxxxx> wrote:
>
> Update: the inactive PG got recovered and active after a loooonngg wait. The middle question is now answered. However, these two questions are still of great worry:
>
> - How can 2 OSDs be missing if only 1 OSD is down?
> - If the PG should recover, why is it not prioritised considering its severe degradation
>  compared with all other PGs?
>
> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks really really bad to me (did ceph loose track of data??).
>
> The second is of no less importance. The inactive PG was holding back client IO, leading to further warnings about slow OPS/requests/... Why are such critically degraded PGs not scheduled for recovery first? There is a service outage but only a health warning?
>
> Thanks and best regards.
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <frans@xxxxxx>
> Sent: 27 July 2022 17:19:05
> To: ceph-users@xxxxxxx
> Subject:  PG does not become active
>
> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs of this pool 2 (!!!) shards are missing. This most degraded PG is not becoming active, its stuck inactive but peered.
>
> Questions:
>
> - How can 2 OSDs be missing if only 1 OSD is down?
> - Wasn't there an important code change to allow recovery for an EC PG with at
>  least k shards present even if min_size>k? Do I have to set something?
> - If the PG should recover, why is it not prioritised considering its severe degradation
>  compared with all other PGs?
>
> I have already increased these crush tunables and executed a pg repeer to no avail:
>
> tunable choose_total_tries 250 <-- default 100
> rule fs-data {
>        id 1
>        type erasure
>        min_size 3
>        max_size 6
>        step set_chooseleaf_tries 50 <-- default 5
>        step set_choose_tries 200 <-- default 100
>        step take default
>        step choose indep 0 type osd
>        step emit
> }
>
> Ceph health detail says to that:
>
> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>    pg 4.32 is stuck inactive for 37m, current state recovery_wait+undersized+degraded+remapped+peered, last acting [1,2147483647,2147483647,4,5,2]
>
> I don't want to cheat and set min_size=k on this pool. It should work by itself.
>
> Thanks for any pointers!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx