Re: Bug in crush algorithm? 1 PG with the same OSD twice.

Dan van der Ster <dvanders@xxxxxxxxx> · Tue, 30 Aug 2022 11:41:59 +0200

Hi Frank,

I suspect this is a combination of issues.
1. You have "choose" instead of "chooseleaf" in rule 1.
2. osd.7 is destroyed but still "up" in the osdmap.
3. The _tries settings in rule 1 are not helping.

Here are my tests:

# osdmaptool --test-map-pg 4.1c osdmap.bin
osdmaptool: osdmap file 'osdmap.bin'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
acting ([6,1,4,5,3,1], p6)

^^ This is what you observe now.

# diff -u crush.txt crush.txt2

--- crush.txt 2022-08-30 11:27:41.941836374 +0200
+++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
@@ -93,7 +93,7 @@
  step set_chooseleaf_tries 50
  step set_choose_tries 200
  step take default
- step choose indep 0 type osd
+ step chooseleaf indep 0 type osd
  step emit
 }
# crushtool -c crush.txt2 -o crush.map2
# cp osdmap.bin osdmap.bin2
# osdmaptool --import-crush crush.map2 osdmap.bin2
osdmaptool: osdmap file 'osdmap.bin2'
osdmaptool: imported 1166 byte crush map from crush.map2
osdmaptool: writing epoch 4990 to osdmap.bin2
# osdmaptool --test-map-pg 4.1c osdmap.bin2
osdmaptool: osdmap file 'osdmap.bin2'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
([6,1,4,5,3,1], p6)

^^ The mapping is now "correct" in that it doesn't duplicate the
mapping to osd.1. However it tries to use osd.7 which is destroyed but
up.

You might be able to fix that by fully marking osd.7 out.
I can also get a good mapping by removing the *_tries settings from rule 1:

# diff -u crush.txt crush.txt2
--- crush.txt 2022-08-30 11:27:41.941836374 +0200
+++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
@@ -90,10 +90,8 @@
  type erasure
  min_size 3
  max_size 6
- step set_chooseleaf_tries 50
- step set_choose_tries 200
  step take default
- step choose indep 0 type osd
+ step chooseleaf indep 0 type osd
  step emit
 }
...
# osdmaptool --test-map-pg 4.1c osdmap.bin2
osdmaptool: osdmap file 'osdmap.bin2'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)

Note that I didn't need to adjust the reweights:

# osdmaptool osdmap.bin2 --tree
osdmaptool: osdmap file 'osdmap.bin2'
ID CLASS WEIGHT  TYPE NAME         STATUS    REWEIGHT PRI-AFF
-1       2.44798 root default
-7       0.81599     host tceph-01
 0   hdd 0.27199         osd.0            up  0.87999 1.00000
 3   hdd 0.27199         osd.3            up  0.98000 1.00000
 6   hdd 0.27199         osd.6            up  0.92999 1.00000
-3       0.81599     host tceph-02
 2   hdd 0.27199         osd.2            up  0.95999 1.00000
 4   hdd 0.27199         osd.4            up  0.89999 1.00000
 8   hdd 0.27199         osd.8            up  0.89999 1.00000
-5       0.81599     host tceph-03
 1   hdd 0.27199         osd.1            up  0.89999 1.00000
 5   hdd 0.27199         osd.5            up  1.00000 1.00000
 7   hdd 0.27199         osd.7     destroyed        0 1.00000


Does this work in real life?

Cheers, Dan


On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> please find attached (only 7K, so I hope it goes through). md5sum=1504652f1b95802a9f2fe3725bf1336e
>
> I was playing a bit around with the crush map and found out the following:
>
> 1) Setting all re-weights to 1 does produce valid mappings. However, it will lead to large imbalances and is impractical in operations.
>
> 2) Doing something as simple/stupid as the following also results in valid mappings without having to change the weights:
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 50
>         step set_choose_tries 200
>         step take default
>         step chooseleaf indep 3 type host
>         step emit
>         step take default
>         step chooseleaf indep -3 type host
>         step emit
> }
>
> rule fs-data {
>         id 1
>         type erasure
>         min_size 3
>         max_size 6
>         step set_chooseleaf_tries 50
>         step set_choose_tries 200
>         step take default
>         step choose indep 3 type osd
>         step emit
>         step take default
>         step choose indep -3 type osd
>         step emit
> }
>
> Of course, now the current weights are probably unsuitable as everything moves around. Its probably also a lot more total tries to get rid of mappings with duplicate OSDs.
>
> I probably have to read the code to understand how drawing straws from 8 different buckets with non-zero probabilities can lead to an infinite sequence of failed attempts of getting 6 different ones. There seems to be a hard-coded tunable that turns seemingly infinite into finite somehow.
>
> The first modified rule will probably lead to better distribution of load, but bad distribution of data if a disk goes down (considering the tiny host- and disk numbers). The second rule seems to be almost as good or bad as the default one (step choose indep 0 type osd), except that it does produce valid mappings where the default rule fails.
>
> I will wait with changing the rule in the hope that you find a more elegant solution to this riddle.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> Sent: 29 August 2022 19:13
> To: Frank Schilder
> Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.
>
> Hi Frank,
>
> Could you share the osdmap so I can try to solve this riddle?
>
> Cheers , Dan
>
>
> On Mon, Aug 29, 2022, 17:26 Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> Hi Dan,
>
> thanks for your answer. I'm not really convinced that we hit a corner case here and even if its one, it seems quite relevant for production clusters. The usual way to get a valid mapping is to increase the number of tries. I increased the following max trial numbers, which I would expect to produce a mapping for all PGs:
>
> # diff map-now.txt map-new.txt
> 4c4
> < tunable choose_total_tries 50
> ---
> > tunable choose_total_tries 250
> 93,94c93,94
> <       step set_chooseleaf_tries 5
> <       step set_choose_tries 100
> ---
> >       step set_chooseleaf_tries 50
> >       step set_choose_tries 200
>
> When I test the map with crushtool it does not report bad mappings. Am I looking at the wrong tunables to increase? It should be possible to get valid mappings without having to modify the re-weights.
>
> Thanks again for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>
> Sent: 29 August 2022 16:52:52
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.
>
> Hi Frank,
>
> CRUSH can only find 5 OSDs, given your current tree, rule, and
> reweights. This is why there is a NONE in the UP set for shard 6.
> But in ACTING we see that it is refusing to remove shard 6 from osd.1
> -- that is the only copy of that shard, so in this case it's helping
> you rather than deleting the shard altogether.
> ACTING == what the OSDs are serving now.
> UP == where CRUSH wants to place the shards.
>
> I suspect that this is a case of CRUSH tunables + your reweights
> putting CRUSH in a corner case of not finding 6 OSDs for that
> particular PG.
> If you set the reweights all back to 1, it probably finds 6 OSDs?
>
> Cheers, Dan
>
>
> On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> >
> > Hi all,
> >
> > I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. After simulating a disk fail by removing an OSD and letting the cluster recover (all under load), I end up with a PG with the same OSD allocated twice:
> >
> > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
> >
> > OSD 1 is allocated twice. How is this even possible?
> >
> > Here the OSD tree:
> >
> > ID  CLASS  WEIGHT   TYPE NAME          STATUS     REWEIGHT  PRI-AFF
> > -1         2.44798  root default
> > -7         0.81599      host tceph-01
> >  0    hdd  0.27199          osd.0             up   0.87999  1.00000
> >  3    hdd  0.27199          osd.3             up   0.98000  1.00000
> >  6    hdd  0.27199          osd.6             up   0.92999  1.00000
> > -3         0.81599      host tceph-02
> >  2    hdd  0.27199          osd.2             up   0.95999  1.00000
> >  4    hdd  0.27199          osd.4             up   0.89999  1.00000
> >  8    hdd  0.27199          osd.8             up   0.89999  1.00000
> > -5         0.81599      host tceph-03
> >  1    hdd  0.27199          osd.1             up   0.89999  1.00000
> >  5    hdd  0.27199          osd.5             up   1.00000  1.00000
> >  7    hdd  0.27199          osd.7      destroyed         0  1.00000
> >
> > I tried already to change some tunables thinking about https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, but giving up too soon is obviously not the problem. It is accepting a wrong mapping.
> >
> > Is there a way out of this? Clearly this is calling for trouble if not data loss and should not happen at all.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx