Re: Bug in crush algorithm? 1 PG with the same OSD twice.

Frank Schilder <frans@xxxxxx> · Mon, 29 Aug 2022 17:38:11 +0000

Hi Dan,

please find attached (only 7K, so I hope it goes through). md5sum=1504652f1b95802a9f2fe3725bf1336e

I was playing a bit around with the crush map and found out the following:

1) Setting all re-weights to 1 does produce valid mappings. However, it will lead to large imbalances and is impractical in operations.

2) Doing something as simple/stupid as the following also results in valid mappings without having to change the weights:

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 50
        step set_choose_tries 200
        step take default
        step chooseleaf indep 3 type host
        step emit
        step take default
        step chooseleaf indep -3 type host
        step emit
}

rule fs-data {
        id 1
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 50
        step set_choose_tries 200
        step take default
        step choose indep 3 type osd
        step emit
        step take default
        step choose indep -3 type osd
        step emit
}

Of course, now the current weights are probably unsuitable as everything moves around. Its probably also a lot more total tries to get rid of mappings with duplicate OSDs.

I probably have to read the code to understand how drawing straws from 8 different buckets with non-zero probabilities can lead to an infinite sequence of failed attempts of getting 6 different ones. There seems to be a hard-coded tunable that turns seemingly infinite into finite somehow.

The first modified rule will probably lead to better distribution of load, but bad distribution of data if a disk goes down (considering the tiny host- and disk numbers). The second rule seems to be almost as good or bad as the default one (step choose indep 0 type osd), except that it does produce valid mappings where the default rule fails.

I will wait with changing the rule in the hope that you find a more elegant solution to this riddle.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dvanders@xxxxxxxxx>
Sent: 29 August 2022 19:13
To: Frank Schilder
Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.

Hi Frank,

Could you share the osdmap so I can try to solve this riddle?

Cheers , Dan

On Mon, Aug 29, 2022, 17:26 Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi Dan,

thanks for your answer. I'm not really convinced that we hit a corner case here and even if its one, it seems quite relevant for production clusters. The usual way to get a valid mapping is to increase the number of tries. I increased the following max trial numbers, which I would expect to produce a mapping for all PGs:

# diff map-now.txt map-new.txt
4c4
< tunable choose_total_tries 50
---
> tunable choose_total_tries 250
93,94c93,94
<       step set_chooseleaf_tries 5
<       step set_choose_tries 100
---
>       step set_chooseleaf_tries 50
>       step set_choose_tries 200

When I test the map with crushtool it does not report bad mappings. Am I looking at the wrong tunables to increase? It should be possible to get valid mappings without having to modify the re-weights.

Thanks again for your help!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dvanders@xxxxxxxxx<mailto:dvanders@xxxxxxxxx>>
Sent: 29 August 2022 16:52:52
To: Frank Schilder
Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject: Re:  Bug in crush algorithm? 1 PG with the same OSD twice.

Hi Frank,

CRUSH can only find 5 OSDs, given your current tree, rule, and
reweights. This is why there is a NONE in the UP set for shard 6.
But in ACTING we see that it is refusing to remove shard 6 from osd.1
-- that is the only copy of that shard, so in this case it's helping
you rather than deleting the shard altogether.
ACTING == what the OSDs are serving now.
UP == where CRUSH wants to place the shards.

I suspect that this is a case of CRUSH tunables + your reweights
putting CRUSH in a corner case of not finding 6 OSDs for that
particular PG.
If you set the reweights all back to 1, it probably finds 6 OSDs?

Cheers, Dan

On Mon, Aug 29, 2022 at 4:44 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
>
> Hi all,
>
> I'm investigating a problem with a degenerated PG on an octopus 15.2.16 test cluster. It has 3Hosts x 3OSDs and a 4+2 EC pool with failure domain OSD. After simulating a disk fail by removing an OSD and letting the cluster recover (all under load), I end up with a PG with the same OSD allocated twice:
>
> PG 4.1c, UP: [6,1,4,5,3,NONE] ACTING: [6,1,4,5,3,1]
>
> OSD 1 is allocated twice. How is this even possible?
>
> Here the OSD tree:
>
> ID  CLASS  WEIGHT   TYPE NAME          STATUS     REWEIGHT  PRI-AFF
> -1         2.44798  root default
> -7         0.81599      host tceph-01
>  0    hdd  0.27199          osd.0             up   0.87999  1.00000
>  3    hdd  0.27199          osd.3             up   0.98000  1.00000
>  6    hdd  0.27199          osd.6             up   0.92999  1.00000
> -3         0.81599      host tceph-02
>  2    hdd  0.27199          osd.2             up   0.95999  1.00000
>  4    hdd  0.27199          osd.4             up   0.89999  1.00000
>  8    hdd  0.27199          osd.8             up   0.89999  1.00000
> -5         0.81599      host tceph-03
>  1    hdd  0.27199          osd.1             up   0.89999  1.00000
>  5    hdd  0.27199          osd.5             up   1.00000  1.00000
>  7    hdd  0.27199          osd.7      destroyed         0  1.00000
>
> I tried already to change some tunables thinking about https://docs.ceph.com/en/octopus/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon, but giving up too soon is obviously not the problem. It is accepting a wrong mapping.
>
> Is there a way out of this? Clearly this is calling for trouble if not data loss and should not happen at all.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx