Re: PG does not become active

Jesper Lykkegaard Karlsen <jelka@xxxxxxxxx> · Thu, 28 Jul 2022 11:47:16 +0000

Ah I see, should have look at the “raw” data instead ;-)

Then I agree this very weird?

Best, 
Jesper

--------------------------
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Universitetsbyen 81
8000 Aarhus C

E-mail: jelka@xxxxxxxxx
Tlf:    +45 50906203

> On 28 Jul 2022, at 12.45, Frank Schilder <frans@xxxxxx> wrote:
> 
> Hi Jesper,
> 
> thanks for looking at this. The failure domain is OSD and not host. I typed it wrong in the text, the copy of the crush rule shows it right: step choose indep 0 type osd.
> 
> I'm trying to reproduce the observation to file a tracker item, but it is more difficult than expected. It might be a race condition, so far I didn't see it again. I hope I can figure out when and why this is happening.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Jesper Lykkegaard Karlsen <jelka@xxxxxxxxx>
> Sent: 28 July 2022 12:02:51
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  PG does not become active
> 
> Hi Frank,
> 
> I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host.
> 
> I do not know how it was possible for you to create that configuration at first?
> Could it be that you have multiple name for the OSD hosts?
> That would at least explain the one OSD down, being show as two OSDs down.
> 
> Also, I believe that min_size should never be smaller than “coding” shards, which is 4 in this case.
> 
> You can either make a new test setup with your three test OSD hosts using EC 2+1 or make e.g. 4+2, but with failure domain set to OSD.
> 
> Best,
> Jesper
> 
> --------------------------
> Jesper Lykkegaard Karlsen
> Scientific Computing
> Centre for Structural Biology
> Department of Molecular Biology and Genetics
> Aarhus University
> Universitetsbyen 81
> 8000 Aarhus C
> 
> E-mail: jelka@xxxxxxxxx
> Tlf:    +45 50906203
> 
>> On 27 Jul 2022, at 17.32, Frank Schilder <frans@xxxxxx> wrote:
>> 
>> Update: the inactive PG got recovered and active after a loooonngg wait. The middle question is now answered. However, these two questions are still of great worry:
>> 
>> - How can 2 OSDs be missing if only 1 OSD is down?
>> - If the PG should recover, why is it not prioritised considering its severe degradation
>> compared with all other PGs?
>> 
>> I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks really really bad to me (did ceph loose track of data??).
>> 
>> The second is of no less importance. The inactive PG was holding back client IO, leading to further warnings about slow OPS/requests/... Why are such critically degraded PGs not scheduled for recovery first? There is a service outage but only a health warning?
>> 
>> Thanks and best regards.
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: 27 July 2022 17:19:05
>> To: ceph-users@xxxxxxx
>> Subject:  PG does not become active
>> 
>> I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs of this pool 2 (!!!) shards are missing. This most degraded PG is not becoming active, its stuck inactive but peered.
>> 
>> Questions:
>> 
>> - How can 2 OSDs be missing if only 1 OSD is down?
>> - Wasn't there an important code change to allow recovery for an EC PG with at
>> least k shards present even if min_size>k? Do I have to set something?
>> - If the PG should recover, why is it not prioritised considering its severe degradation
>> compared with all other PGs?
>> 
>> I have already increased these crush tunables and executed a pg repeer to no avail:
>> 
>> tunable choose_total_tries 250 <-- default 100
>> rule fs-data {
>>       id 1
>>       type erasure
>>       min_size 3
>>       max_size 6
>>       step set_chooseleaf_tries 50 <-- default 5
>>       step set_choose_tries 200 <-- default 100
>>       step take default
>>       step choose indep 0 type osd
>>       step emit
>> }
>> 
>> Ceph health detail says to that:
>> 
>> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>>   pg 4.32 is stuck inactive for 37m, current state recovery_wait+undersized+degraded+remapped+peered, last acting [1,2147483647,2147483647,4,5,2]
>> 
>> I don't want to cheat and set min_size=k on this pool. It should work by itself.
>> 
>> Thanks for any pointers!
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx