Re: Bug in crush algorithm? 1 PG with the same OSD twice.

Frank Schilder <frans@xxxxxx> · Tue, 30 Aug 2022 11:10:04 +0000

Hi Dan,

thanks a lot for looking into this. I can't entirely reproduce your results. Maybe we are using different versions and there was a change? I'm testing with the octopus 16.2.16 image: quay.io/ceph/ceph:v15.2.16.

Note: "step chose" was selected by creating the crush rule with ceph on pool creation. If the default should be "step choseleaf" (with OSD buckets), then the automatic crush rule generation in ceph ought to be fixed for EC profiles.

My results with the same experiments as you did, I can partly confirm and partly I see oddness that I would consider a bug (reported at the very end):

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step take default
        step choose indep 0 type osd
        step emit
}

# osdmaptool --test-map-pg 4.1c osdmap.bin
osdmaptool: osdmap file 'osdmap.bin'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6) acting ([6,1,4,5,3,1], p6)

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step take default
        step chooseleaf indep 0 type osd
        step emit
}

# osdmaptool --test-map-pg 4.1c osdmap.bin
osdmaptool: osdmap file 'osdmap.bin'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)

So far, so good. Now the oddness:

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default
        step chooseleaf indep 0 type osd
        step emit
}

# osdmaptool --test-map-pg 4.1c osdmap.bin
osdmaptool: osdmap file 'osdmap.bin'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)

How can this be different?? I thought crush returns on the first successful mapping. This ought to be identical to the previous one. It gets even more weird:

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 50
        step set_choose_tries 200
        step take default
        step chooseleaf indep 0 type osd
        step emit
}

# osdmaptool --test-map-pg 4.1c osdmap.bin
osdmaptool: osdmap file 'osdmap.bin'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting ([6,1,4,5,3,1], p6)

Whaaaaat???? We increase the maximum number of trials for searching and we end up with an invalid mapping??

These experiments indicate that there is a very weird behaviour implemented, I would actually call this a serious bug. The resulting mapping should be independent of the maximum number of trials (if I understood the crush algorithm correctly). In any case, a valid mapping should never be replaced in favour of an invalid one (containing a down+out OSD).

For now there is a happy end on my test cluster:

# ceph pg dump pgs_brief | grep 4.1c
dumped pgs_brief
4.1c     active+remapped+backfilling  [6,1,4,5,3,8]           6  [6,1,4,5,3,1]               6

Please look into the extremely odd behaviour reported above. I'm quite confident that this is unintended if not dangerous behaviour and should be corrected. I'm willing to file a tracker item with the data above. I'm actually wondering if this might be related to https://tracker.ceph.com/issues/56995 .

Thanks for tracking this down and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dvanders@xxxxxxxxx>
Sent: 30 August 2022 12:16:37
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.

BTW, the defaults for _tries seems to work too:


# diff -u crush.txt crush.txt2

--- crush.txt 2022-08-30 11:27:41.941836374 +0200
+++ crush.txt2 2022-08-30 11:55:45.601891010 +0200
@@ -90,10 +90,10 @@
  type erasure
  min_size 3
  max_size 6
- step set_chooseleaf_tries 50
- step set_choose_tries 200
+ step set_chooseleaf_tries 5
+ step set_choose_tries 100
  step take default
- step choose indep 0 type osd
+ step chooseleaf indep 0 type osd
  step emit
 }

# osdmaptool --test-map-pg 4.1c osdmap.bin2
osdmaptool: osdmap file 'osdmap.bin2'
 parsed '4.1c' -> 4.1c
4.1c raw ([6,1,4,5,3,8], p6) up ([6,1,4,5,3,8], p6) acting ([6,1,4,5,3,1], p6)


-- dan

On Tue, Aug 30, 2022 at 11:50 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
>
> BTW, I vaguely recalled seeing this before. Yup, found it:
> https://tracker.ceph.com/issues/55169
>
> On Tue, Aug 30, 2022 at 11:46 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
> >
> > > 2. osd.7 is destroyed but still "up" in the osdmap.
> >
> > Oops, you can ignore this point -- this was an observation I had while
> > playing with the osdmap -- your osdmap.bin has osd.7 down correctly.
> >
> > In case you're curious, here was what confused me:
> >
> > # osdmaptool osdmap.bin2  --mark-up-in --mark-out 7 --dump plain
> > osd.7 up   out weight 0 up_from 3846 up_thru 3853 down_at 3855
> > last_clean_interval [0,0)
> > [v2:10.41.24.15:6810/1915819,v1:10.41.24.15:6811/1915819]
> > [v2:192.168.0.15:6808/1915819,v1:192.168.0.15:6809/1915819]
> > destroyed,exists,up
> >
> > Just ignore this ...
> >
> >
> >
> > -- dan
> >
> > On Tue, Aug 30, 2022 at 11:41 AM Dan van der Ster <dvanders@xxxxxxxxx> wrote:
> > >
> > > Hi Frank,
> > >
> > > I suspect this is a combination of issues.
> > > 1. You have "choose" instead of "chooseleaf" in rule 1.
> > > 2. osd.7 is destroyed but still "up" in the osdmap.
> > > 3. The _tries settings in rule 1 are not helping.
> > >
> > > Here are my tests:
> > >
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin
> > > osdmaptool: osdmap file 'osdmap.bin'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,2147483647], p6) up ([6,1,4,5,3,2147483647], p6)
> > > acting ([6,1,4,5,3,1], p6)
> > >
> > > ^^ This is what you observe now.
> > >
> > > # diff -u crush.txt crush.txt2
> > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > +++ crush.txt2 2022-08-30 11:31:29.631491424 +0200
> > > @@ -93,7 +93,7 @@
> > >   step set_chooseleaf_tries 50
> > >   step set_choose_tries 200
> > >   step take default
> > > - step choose indep 0 type osd
> > > + step chooseleaf indep 0 type osd
> > >   step emit
> > >  }
> > > # crushtool -c crush.txt2 -o crush.map2
> > > # cp osdmap.bin osdmap.bin2
> > > # osdmaptool --import-crush crush.map2 osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > > osdmaptool: imported 1166 byte crush map from crush.map2
> > > osdmaptool: writing epoch 4990 to osdmap.bin2
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,7], p6) up ([6,1,4,5,3,2147483647], p6) acting
> > > ([6,1,4,5,3,1], p6)
> > >
> > > ^^ The mapping is now "correct" in that it doesn't duplicate the
> > > mapping to osd.1. However it tries to use osd.7 which is destroyed but
> > > up.
> > >
> > > You might be able to fix that by fully marking osd.7 out.
> > > I can also get a good mapping by removing the *_tries settings from rule 1:
> > >
> > > # diff -u crush.txt crush.txt2
> > > --- crush.txt 2022-08-30 11:27:41.941836374 +0200
> > > +++ crush.txt2 2022-08-30 11:38:14.068102835 +0200
> > > @@ -90,10 +90,8 @@
> > >   type erasure
> > >   min_size 3
> > >   max_size 6
> > > - step set_chooseleaf_tries 50
> > > - step set_choose_tries 200
> > >   step take default
> > > - step choose indep 0 type osd
> > > + step chooseleaf indep 0 type osd
> > >   step emit
> > >  }
> > > ...
> > > # osdmaptool --test-map-pg 4.1c osdmap.bin2
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > >  parsed '4.1c' -> 4.1c
> > > 4.1c raw ([6,1,4,5,3,2], p6) up ([6,1,4,5,3,2], p6) acting ([6,1,4,5,3,1], p6)
> > >
> > > Note that I didn't need to adjust the reweights:
> > >
> > > # osdmaptool osdmap.bin2 --tree
> > > osdmaptool: osdmap file 'osdmap.bin2'
> > > ID CLASS WEIGHT  TYPE NAME         STATUS    REWEIGHT PRI-AFF
> > > -1       2.44798 root default
> > > -7       0.81599     host tceph-01
> > >  0   hdd 0.27199         osd.0            up  0.87999 1.00000
> > >  3   hdd 0.27199         osd.3            up  0.98000 1.00000
> > >  6   hdd 0.27199         osd.6            up  0.92999 1.00000
> > > -3       0.81599     host tceph-02
> > >  2   hdd 0.27199         osd.2            up  0.95999 1.00000
> > >  4   hdd 0.27199         osd.4            up  0.89999 1.00000
> > >  8   hdd 0.27199         osd.8            up  0.89999 1.00000
> > > -5       0.81599     host tceph-03
> > >  1   hdd 0.27199         osd.1            up  0.89999 1.00000
> > >  5   hdd 0.27199         osd.5            up  1.00000 1.00000
> > >  7   hdd 0.27199         osd.7     destroyed        0 1.00000
> > >
> > >
> > > Does this work in real life?
> > >
> > > Cheers, Dan
> > >
> > >
> > > On Mon, Aug 29, 2022 at 7:38 PM Frank Schilder <frans@xxxxxx> wrote:
> > > >
> > > > Hi Dan,
> > > >
> > > > please find attached (only 7K, so I hope it goes through). md5sum=1504652f1b95802a9f2fe3725bf1336e
> > > >
> > > > I was playing a bit around with the crush map and found out the following:
> > > >
> > > > 1) Setting all re-weights to 1 does produce valid mappings. However, it will lead to large imbalances and is impractical in operations.
> > > >
> > > > 2) Doing something as simple/stupid as the following also results in valid mappings without having to change the weights:
> > > >
> > > > rule fs-data {
> > > >         id 1
> > > >         type erasure
> > > >         min_size 3
> > > >         max_size 6
> > > >         step set_chooseleaf_tries 50
> > > >         step set_choose_tries 200
> > > >         step take default
> > > >         step chooseleaf indep 3 type host
> > > >         step emit
> > > >         step take default
> > > >         step chooseleaf indep -3 type host
> > > >         step emit
> > > > }
> > > >
> > > > rule fs-data {
> > > >         id 1
> > > >         type erasure
> > > >         min_size 3
> > > >         max_size 6
> > > >         step set_chooseleaf_tries 50
> > > >         step set_choose_tries 200
> > > >         step take default
> > > >         step choose indep 3 type osd
> > > >         step emit
> > > >         step take default
> > > >         step choose indep -3 type osd
> > > >         step emit
> > > > }
> > > >
> > > > Of course, now the current weights are probably unsuitable as everything moves around. Its probably also a lot more total tries to get rid of mappings with duplicate OSDs.
> > > >
> > > > I probably have to read the code to understand how drawing straws from 8 different buckets with non-zero probabilities can lead to an infinite sequence of failed attempts of getting 6 different ones. There seems to be a hard-coded tunable that turns seemingly infinite into finite somehow.
> > > >
> > > > The first modified rule will probably lead to better distribution of load, but bad distribution of data if a disk goes down (considering the tiny host- and disk numbers). The second rule seems to be almost as good or bad as the default one (step choose indep 0 type osd), except that it does produce valid mappings where the default rule fails.
> > > >
> > > > I will wait with changing the rule in the hope that you find a more elegant solution to this riddle.
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > >
> > > > ________________________________________
> > > > From: Dan van der Ster <dvanders@xxxxxxxxx>
> > > > Sent: 29 August 2022 19:13
> > > > To: Frank Schilder
> > > > Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.
> > > >
> > > > Hi Frank,
> > > >
> > > > Could you share the osdmap so I can try to solve this riddle?
> > > >
> > > > Cheers , Dan
> > > >
> > > >
> > > > On Mon, Aug 29, 2022, 17:26 Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> > > > Hi Dan,
> > > >
> > > > thanks for your answer. I'm not really convinced that we hit a corner case here and even if its one, it seems quite relevant for production clusters. The usual way to get a valid mapping is to increase the number of tries. I increased the following max trial numbers, which I would expect to produce a mapping for all PGs:
> > > >
> > > > # diff map-now.txt map-new.txt
> > > > 4c4
> > > > < tunable choose_total_tries 50
> > > > ---
> > > > > tunable choose_total_tries 250
> > > > 93,94c93,94
> > > > <       step set_chooseleaf_tries 5
> > > > <       step set_choose_tries 100
> > > > ---
> > > > >       step set_chooseleaf_tries 50
> > > > >       step set_choose_tries 200
> > > >
> > > > When I test the map with crushtool it does not report bad mappings. Am I looking at the wrong tunables to increase? It should be possible to get valid mappings without having to modify the re-weights.
> > > >
> > > > Thanks again for your help!
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > >
> > > > ________________________________________
> > > > From: Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>
> > > > Sent: 29 August 2022 16:52:52
> > > > To: Frank Schilder
> > > > Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > > > Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.
> > > >
> > > > Hi Frank,
> > > >
> > > > CRUSH can only find 5 OSDs, given your current tree, rule, and
> > > > reweights. This is why there is a NONE in the UP set for shard 6.
> > > > But in ACTING we see that it is refusing to remove shard 6 from osd.1
> > > > -- that is the only copy of that shard, so in this case it's helping
> > > > you rather than deleting the shard altogether.
> > > > ACTING == what the OSDs are serving now.
> > > > UP == where CRUSH wants to place the shards.
> > > >
> > > > I suspect that this is a case of CRUSH tunables + your reweights
> > > > putting CRUSH in a corner case of not finding 6 OSDs for that
> > > > particular PG.
> > > > If you set the reweights all back to 1, it probably finds 6 OSDs?
> > > >
> > > > Cheers, Dan
> > > >
> > > >
> > > > On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. After simulating a disk fail by removing an OSD and letting the cluster recover (all under load), I end up with a PG with the same OSD allocated twice:
> > > > >
> > > > > PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
> > > > >
> > > > > OSD 1 is allocated twice. How is this even possible?
> > > > >
> > > > > Here the OSD tree:
> > > > >
> > > > > ID  CLASS  WEIGHT   TYPE NAME          STATUS     REWEIGHT  PRI-AFF
> > > > > -1         2.44798  root default
> > > > > -7         0.81599      host tceph-01
> > > > >  0    hdd  0.27199          osd.0             up   0.87999  1.00000
> > > > >  3    hdd  0.27199          osd.3             up   0.98000  1.00000
> > > > >  6    hdd  0.27199          osd.6             up   0.92999  1.00000
> > > > > -3         0.81599      host tceph-02
> > > > >  2    hdd  0.27199          osd.2             up   0.95999  1.00000
> > > > >  4    hdd  0.27199          osd.4             up   0.89999  1.00000
> > > > >  8    hdd  0.27199          osd.8             up   0.89999  1.00000
> > > > > -5         0.81599      host tceph-03
> > > > >  1    hdd  0.27199          osd.1             up   0.89999  1.00000
> > > > >  5    hdd  0.27199          osd.5             up   1.00000  1.00000
> > > > >  7    hdd  0.27199          osd.7      destroyed         0  1.00000
> > > > >
> > > > > I tried already to change some tunables thinking about https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, but giving up too soon is obviously not the problem. It is accepting a wrong mapping.
> > > > >
> > > > > Is there a way out of this? Clearly this is calling for trouble if not data loss and should not happen at all.
> > > > >
> > > > > Best regards,
> > > > > =================
> > > > > Frank Schilder
> > > > > AIT Risø Campus
> > > > > Bygning 109, rum S14
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx