Re: unknown PGs after adding hosts in different subtree

Frank Schilder <frans@xxxxxx> · Fri, 24 May 2024 14:24:33 +0000

Hi Eugen,

so it is partly "unexpectedly expected" and partly buggy. I really wish the crush implementation was honouring a few obvious invariants. It is extremely counter-intuitive that mappings taken from a sub-set change even if both, the sub-set and the mapping instructions themselves don't.

> - Use different root names

That's what we are doing and it works like a charm, also for draining OSDs.

> more specific crush rules.

I guess you mean use something like "step take DCA class hdd" instead of "step take default class hdd" as in:

rule rule-ec-k7m11 {
        id 1
        type erasure
        min_size 3
        max_size 18
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take DCA class hdd
        step chooseleaf indep 9 type host
        step take DCB class hdd
        step chooseleaf indep 9 type host
        step emit
}

According to the documentation, this should actually work and be almost equivalent to your crush rule. The difference here is that it will make sure that the first 9 shards are from DCA and the second 9 shards from DCB (its an ordering). Side effect is that all primary OSDs will be in DCA if both DCs are up. I remember people asking for that as a feature in multi-DC set-ups to pick the one with lowest latency to have the primary OSDs by default.

Can you give this crush rule a try and report back whether or not the behaviour when adding hosts changes?

In case you have time, it would be great if you could collect information on (reproducing) the fatal peering problem. While remappings might be "unexpectedly expected" it is clearly a serious bug that incomplete and unknown PGs show up in the process of adding hosts at the root.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@xxxxxxx
Subject:  Re: unknown PGs after adding hosts in different subtree

I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:

          step take default class hdd

My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:

- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.

Zitat von Eugen Block <eblock@xxxxxx>:

> Hi Frank,
>
> thanks for looking up those trackers. I haven't looked into them
> yet, I'll read your response in detail later, but I wanted to add
> some new observation:
>
> I added another root bucket (custom) to the osd tree:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
> -12               0  root custom
>  -1         0.27698  root default
>  -8         0.09399      room room1
>  -3         0.04700          host host1
>   7    hdd  0.02299              osd.7       up   1.00000  1.00000
>  10    hdd  0.02299              osd.10      up   1.00000  1.00000
> ...
>
> Then I tried this approach to add a new host directly to the
> non-default root:
>
> # cat host5.yaml
> service_type: host
> hostname: host5
> addr: 192.168.168.54
> location:
>   root: custom
> labels:
>    - osd
>
> # ceph orch apply -i host5.yaml
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
> -12         0.04678  root custom
> -23         0.04678      host host5
>   1    hdd  0.02339          osd.1           up   1.00000  1.00000
>  13    hdd  0.02339          osd.13          up   1.00000  1.00000
>  -1         0.27698  root default
>  -8         0.09399      room room1
>  -3         0.04700          host host1
>   7    hdd  0.02299              osd.7       up   1.00000  1.00000
>  10    hdd  0.02299              osd.10      up   1.00000  1.00000
> ...
>
> host5 is placed directly underneath the new custom root correctly,
> but not a single PG is marked "remapped"! So this is actually what I
> (or we) expected. I'm not sure yet what to make of it, but I'm
> leaning towards using this approach in the future and add hosts
> underneath a different root first, and then move it to its
> designated location.
>
> Just to validate again, I added host6 without a location spec, so
> it's placed underneath the default root again:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
> -12         0.04678  root custom
> -23         0.04678      host host5
>   1    hdd  0.02339          osd.1           up   1.00000  1.00000
>  13    hdd  0.02339          osd.13          up   1.00000  1.00000
>  -1         0.32376  root default
> -25         0.04678      host host6
>  14    hdd  0.02339          osd.14          up   1.00000  1.00000
>  15    hdd  0.02339          osd.15          up   1.00000  1.00000
>  -8         0.09399      room room1
>  -3         0.04700          host host1
> ...
>
> And this leads to remapped PGs again. I assume this must be related
> to the default root. I'm gonna investigate further.
>
> Thanks!
> Eugen
>
>
> Zitat von Frank Schilder <frans@xxxxxx>:
>
>> Hi Eugen,
>>
>> just to add another strangeness observation from long ago:
>> https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't
>> see any reweights in your trees, so its something else. However,
>> there seem to be multiple issues with EC pools and peering.
>>
>> I also want to clarify:
>>
>>> If this is the case, it is possible that this is partly
>>> intentional and partly buggy.
>>
>> "Partly intentional" here means the code behaviour changes when you
>> add OSDs to the root outside the rooms and this change is not
>> considered a bug. It is clearly *not* expected as it means you
>> cannot do maintenance on a pool living on a tree A without
>> affecting pools on the same device class living on an unmodified
>> subtree of A.
>>
>> From a ceph user's point of view everything you observe looks
>> buggy. I would really like to see a good explanation why the
>> mappings in the subtree *should* change when adding OSDs above that
>> subtree as in your case when the expectation for good reasons is
>> that they don't. This would help devising clean procedures for
>> adding hosts when you (and I) want to add OSDs first without any
>> peering and then move OSDs into place to have it happen separate
>> from adding and not a total mess with everything in parallel.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Frank Schilder <frans@xxxxxx>
>> Sent: Thursday, May 23, 2024 6:32 PM
>> To: Eugen Block
>> Cc: ceph-users@xxxxxxx
>> Subject:  Re: unknown PGs after adding hosts in
>> different subtree
>>
>> Hi Eugen,
>>
>> I'm at home now. Could you please check all the remapped PGs that
>> they have no shards on the new OSDs, i.e. its just shuffling around
>> mappings within the same set of OSDs under rooms?
>>
>> If this is the case, it is possible that this is partly intentional
>> and partly buggy. The remapping is then probably intentional and
>> the method I use with a disjoint tree for new hosts prevents such
>> remappings initially (the crush code sees the new OSDs in the root,
>> doesn't use them but their presence does change choice orders
>> resulting in remapped PGs). However, the unknown PGs should clearly
>> not occur.
>>
>> I'm afraid that the peering code has quite a few bugs, I reported
>> something at least similarly weird a long time ago:
>> https://tracker.ceph.com/issues/56995 and
>> https://tracker.ceph.com/issues/46847. Might even be related. It
>> looks like peering can loose track of PG members in certain
>> situations (specifically after adding OSDs until rebalancing
>> completed). In my cases, I get degraded objects even though
>> everything is obviously still around. Flipping between the
>> crush-maps before/after the change re-discovers everything again.
>>
>> Issue 46847 is long-standing and still unresolved. In case you need
>> to file a tracker, please consider to refer to the two above as
>> well as "might be related" if you deem that they might be related.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx