Hi Dan, please find attached (only 7K, so I hope it goes through). md5sum=1504652f1b95802a9f2fe3725bf1336e I was playing a bit around with the crush map and found out the following: 1) Setting all re-weights to 1 does produce valid mappings. However, it will lead to large imbalances and is impractical in operations. 2) Doing something as simple/stupid as the following also results in valid mappings without having to change the weights: rule fs-data { id 1 type erasure min_size 3 max_size 6 step set_chooseleaf_tries 50 step set_choose_tries 200 step take default step chooseleaf indep 3 type host step emit step take default step chooseleaf indep -3 type host step emit } rule fs-data { id 1 type erasure min_size 3 max_size 6 step set_chooseleaf_tries 50 step set_choose_tries 200 step take default step choose indep 3 type osd step emit step take default step choose indep -3 type osd step emit } Of course, now the current weights are probably unsuitable as everything moves around. Its probably also a lot more total tries to get rid of mappings with duplicate OSDs. I probably have to read the code to understand how drawing straws from 8 different buckets with non-zero probabilities can lead to an infinite sequence of failed attempts of getting 6 different ones. There seems to be a hard-coded tunable that turns seemingly infinite into finite somehow. The first modified rule will probably lead to better distribution of load, but bad distribution of data if a disk goes down (considering the tiny host- and disk numbers). The second rule seems to be almost as good or bad as the default one (step choose indep 0 type osd), except that it does produce valid mappings where the default rule fails. I will wait with changing the rule in the hope that you find a more elegant solution to this riddle. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dvanders@xxxxxxxxx> Sent: 29 August 2022 19:13 To: Frank Schilder Subject: Re: Bug in crush algorithm? 1 PG with the same OSD twice. Hi Frank, Could you share the osdmap so I can try to solve this riddle? Cheers , Dan On Mon, Aug 29, 2022, 17:26 Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote: Hi Dan, thanks for your answer. I'm not really convinced that we hit a corner case here and even if its one, it seems quite relevant for production clusters. The usual way to get a valid mapping is to increase the number of tries. I increased the following max trial numbers, which I would expect to produce a mapping for all PGs: # diff map-now.txt map-new.txt 4c4 < tunable choose_total_tries 50 --- > tunable choose_total_tries 250 93,94c93,94 < step set_chooseleaf_tries 5 < step set_choose_tries 100 --- > step set_chooseleaf_tries 50 > step set_choose_tries 200 When I test the map with crushtool it does not report bad mappings. Am I looking at the wrong tunables to increase? It should be possible to get valid mappings without having to modify the re-weights. Thanks again for your help! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>> Sent: 29 August 2022 16:52:52 To: Frank Schilder Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> Subject: Re: Bug in crush algorithm? 1 PG with the same OSD twice. Hi Frank, CRUSH can only find 5 OSDs, given your current tree, rule, and reweights. This is why there is a NONE in the UP set for shard 6. But in ACTING we see that it is refusing to remove shard 6 from osd.1 -- that is the only copy of that shard, so in this case it's helping you rather than deleting the shard altogether. ACTING == what the OSDs are serving now. UP == where CRUSH wants to place the shards. I suspect that this is a case of CRUSH tunables + your reweights putting CRUSH in a corner case of not finding 6 OSDs for that particular PG. If you set the reweights all back to 1, it probably finds 6 OSDs? Cheers, Dan On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote: > > Hi all, > > I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. After simulating a disk fail by removing an OSD and letting the cluster recover (all under load), I end up with a PG with the same OSD allocated twice: > > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1] > > OSD 1 is allocated twice. How is this even possible? > > Here the OSD tree: > > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -1 2.44798 root default > -7 0.81599 host tceph-01 > 0 hdd 0.27199 osd.0 up 0.87999 1.00000 > 3 hdd 0.27199 osd.3 up 0.98000 1.00000 > 6 hdd 0.27199 osd.6 up 0.92999 1.00000 > -3 0.81599 host tceph-02 > 2 hdd 0.27199 osd.2 up 0.95999 1.00000 > 4 hdd 0.27199 osd.4 up 0.89999 1.00000 > 8 hdd 0.27199 osd.8 up 0.89999 1.00000 > -5 0.81599 host tceph-03 > 1 hdd 0.27199 osd.1 up 0.89999 1.00000 > 5 hdd 0.27199 osd.5 up 1.00000 1.00000 > 7 hdd 0.27199 osd.7 destroyed 0 1.00000 > > I tried already to change some tunables thinking about https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, but giving up too soon is obviously not the problem. It is accepting a wrong mapping. > > Is there a way out of this? Clearly this is calling for trouble if not data loss and should not happen at all. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx