Hi Jesper, thanks for looking at this. The failure domain is OSD and not host. I typed it wrong in the text, the copy of the crush rule shows it right: step choose indep 0 type osd. I'm trying to reproduce the observation to file a tracker item, but it is more difficult than expected. It might be a race condition, so far I didn't see it again. I hope I can figure out when and why this is happening. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Jesper Lykkegaard Karlsen <jelka@xxxxxxxxx> Sent: 28 July 2022 12:02:51 To: Frank Schilder Cc: ceph-users@xxxxxxx Subject: Re: PG does not become active Hi Frank, I think you need at least 6 OSD hosts to make EC 4+2 with faillure domain host. I do not know how it was possible for you to create that configuration at first? Could it be that you have multiple name for the OSD hosts? That would at least explain the one OSD down, being show as two OSDs down. Also, I believe that min_size should never be smaller than “coding” shards, which is 4 in this case. You can either make a new test setup with your three test OSD hosts using EC 2+1 or make e.g. 4+2, but with failure domain set to OSD. Best, Jesper -------------------------- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: jelka@xxxxxxxxx Tlf: +45 50906203 > On 27 Jul 2022, at 17.32, Frank Schilder <frans@xxxxxx> wrote: > > Update: the inactive PG got recovered and active after a loooonngg wait. The middle question is now answered. However, these two questions are still of great worry: > > - How can 2 OSDs be missing if only 1 OSD is down? > - If the PG should recover, why is it not prioritised considering its severe degradation > compared with all other PGs? > > I don't understand how a PG can loose 2 shards if 1 OSD goes down. That looks really really bad to me (did ceph loose track of data??). > > The second is of no less importance. The inactive PG was holding back client IO, leading to further warnings about slow OPS/requests/... Why are such critically degraded PGs not scheduled for recovery first? There is a service outage but only a health warning? > > Thanks and best regards. > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 27 July 2022 17:19:05 > To: ceph-users@xxxxxxx > Subject: PG does not become active > > I'm testing octopus 15.2.16 and run into a problem right away. I'm filling up a small test cluster with 3 hosts 3x3 OSDs and killed one OSD to see how recovery works. I have one 4+2 EC pool with failure domain host and on 1 PGs of this pool 2 (!!!) shards are missing. This most degraded PG is not becoming active, its stuck inactive but peered. > > Questions: > > - How can 2 OSDs be missing if only 1 OSD is down? > - Wasn't there an important code change to allow recovery for an EC PG with at > least k shards present even if min_size>k? Do I have to set something? > - If the PG should recover, why is it not prioritised considering its severe degradation > compared with all other PGs? > > I have already increased these crush tunables and executed a pg repeer to no avail: > > tunable choose_total_tries 250 <-- default 100 > rule fs-data { > id 1 > type erasure > min_size 3 > max_size 6 > step set_chooseleaf_tries 50 <-- default 5 > step set_choose_tries 200 <-- default 100 > step take default > step choose indep 0 type osd > step emit > } > > Ceph health detail says to that: > > [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive > pg 4.32 is stuck inactive for 37m, current state recovery_wait+undersized+degraded+remapped+peered, last acting [1,2147483647,2147483647,4,5,2] > > I don't want to cheat and set min_size=k on this pool. It should work by itself. > > Thanks for any pointers! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx