Hi,
after more testing and investigating I created a tracker issue:
https://tracker.ceph.com/issues/66310
My current theory for the inactive/unkown PGs is that the MGR gets
overloaded with mon messages. The default for mgr_mon_messages is only
128, the mgr service has taken over more roles over the years, but the
defaults haven't been adjusted.
Although I couldn't reproduce actual unknown PGs in my lab, I still
see quite high get_or_fail_fail messages, even bursts when I add nodes
to the lab cluster.
This would also explain why we have to fail the mgr so often. It feels
like it has become the first suggestion to almost every mgr related
issue reported in this list.
I don't know what values would make sense, or if mgr_mon_bytes should
be increased as well.
Regards,
Eugen
Zitat von Eugen Block <eblock@xxxxxx>:
Hi,
I guess you mean use something like "step take DCA class hdd"
instead of "step take default class hdd" as in:
rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}
Almost, yes. There needs to be an "emit" step after the first
chooseleaf, so something like this:
step take DCA class hdd
step chooseleaf indep 9 type host
step emit
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
Otherwise the placement according to crushtool would be incomplete
and only 9 chunks get a mapping. With this rule (omit "default")
there are not bad mappings reported, so that would most likely work
as well. But having all primaries in one DC is not optimal, although
for this specific customer it probably wouldn't make a difference.
But in general I agree, not ideal.
In case you have time, it would be great if you could collect
information on (reproducing) the fatal peering problem. While
remappings might be "unexpectedly expected" it is clearly a serious
bug that incomplete and unknown PGs show up in the process of
adding hosts at the root.
Time wouldn't be an issue, but there's no way for me to do that on
the customer's cluster. In my lab it doesn't behave as observed
which isn't surprising without much data and no client load. I'm not
sure yet how to achieve that.
Thanks,
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi Eugen,
so it is partly "unexpectedly expected" and partly buggy. I really
wish the crush implementation was honouring a few obvious
invariants. It is extremely counter-intuitive that mappings taken
from a sub-set change even if both, the sub-set and the mapping
instructions themselves don't.
- Use different root names
That's what we are doing and it works like a charm, also for draining OSDs.
more specific crush rules.
I guess you mean use something like "step take DCA class hdd"
instead of "step take default class hdd" as in:
rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}
According to the documentation, this should actually work and be
almost equivalent to your crush rule. The difference here is that
it will make sure that the first 9 shards are from DCA and the
second 9 shards from DCB (its an ordering). Side effect is that all
primary OSDs will be in DCA if both DCs are up. I remember people
asking for that as a feature in multi-DC set-ups to pick the one
with lowest latency to have the primary OSDs by default.
Can you give this crush rule a try and report back whether or not
the behaviour when adding hosts changes?
In case you have time, it would be great if you could collect
information on (reproducing) the fatal peering problem. While
remappings might be "unexpectedly expected" it is clearly a serious
bug that incomplete and unknown PGs show up in the process of
adding hosts at the root.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@xxxxxxx
Subject: Re: unknown PGs after adding hosts in
different subtree
I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:
step take default class hdd
My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:
- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.
Zitat von Eugen Block <eblock@xxxxxx>:
Hi Frank,
thanks for looking up those trackers. I haven't looked into them
yet, I'll read your response in detail later, but I wanted to add
some new observation:
I added another root bucket (custom) to the osd tree:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 0 root custom
-1 0.27698 root default
-8 0.09399 room room1
-3 0.04700 host host1
7 hdd 0.02299 osd.7 up 1.00000 1.00000
10 hdd 0.02299 osd.10 up 1.00000 1.00000
...
Then I tried this approach to add a new host directly to the
non-default root:
# cat host5.yaml
service_type: host
hostname: host5
addr: 192.168.168.54
location:
root: custom
labels:
- osd
# ceph orch apply -i host5.yaml
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 0.04678 root custom
-23 0.04678 host host5
1 hdd 0.02339 osd.1 up 1.00000 1.00000
13 hdd 0.02339 osd.13 up 1.00000 1.00000
-1 0.27698 root default
-8 0.09399 room room1
-3 0.04700 host host1
7 hdd 0.02299 osd.7 up 1.00000 1.00000
10 hdd 0.02299 osd.10 up 1.00000 1.00000
...
host5 is placed directly underneath the new custom root correctly,
but not a single PG is marked "remapped"! So this is actually what I
(or we) expected. I'm not sure yet what to make of it, but I'm
leaning towards using this approach in the future and add hosts
underneath a different root first, and then move it to its
designated location.
Just to validate again, I added host6 without a location spec, so
it's placed underneath the default root again:
# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-12 0.04678 root custom
-23 0.04678 host host5
1 hdd 0.02339 osd.1 up 1.00000 1.00000
13 hdd 0.02339 osd.13 up 1.00000 1.00000
-1 0.32376 root default
-25 0.04678 host host6
14 hdd 0.02339 osd.14 up 1.00000 1.00000
15 hdd 0.02339 osd.15 up 1.00000 1.00000
-8 0.09399 room room1
-3 0.04700 host host1
...
And this leads to remapped PGs again. I assume this must be related
to the default root. I'm gonna investigate further.
Thanks!
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi Eugen,
just to add another strangeness observation from long ago:
https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't
see any reweights in your trees, so its something else. However,
there seem to be multiple issues with EC pools and peering.
I also want to clarify:
If this is the case, it is possible that this is partly
intentional and partly buggy.
"Partly intentional" here means the code behaviour changes when you
add OSDs to the root outside the rooms and this change is not
considered a bug. It is clearly *not* expected as it means you
cannot do maintenance on a pool living on a tree A without
affecting pools on the same device class living on an unmodified
subtree of A.
From a ceph user's point of view everything you observe looks
buggy. I would really like to see a good explanation why the
mappings in the subtree *should* change when adding OSDs above that
subtree as in your case when the expectation for good reasons is
that they don't. This would help devising clean procedures for
adding hosts when you (and I) want to add OSDs first without any
peering and then move OSDs into place to have it happen separate
from adding and not a total mess with everything in parallel.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Thursday, May 23, 2024 6:32 PM
To: Eugen Block
Cc: ceph-users@xxxxxxx
Subject: Re: unknown PGs after adding hosts in
different subtree
Hi Eugen,
I'm at home now. Could you please check all the remapped PGs that
they have no shards on the new OSDs, i.e. its just shuffling around
mappings within the same set of OSDs under rooms?
If this is the case, it is possible that this is partly intentional
and partly buggy. The remapping is then probably intentional and
the method I use with a disjoint tree for new hosts prevents such
remappings initially (the crush code sees the new OSDs in the root,
doesn't use them but their presence does change choice orders
resulting in remapped PGs). However, the unknown PGs should clearly
not occur.
I'm afraid that the peering code has quite a few bugs, I reported
something at least similarly weird a long time ago:
https://tracker.ceph.com/issues/56995 and
https://tracker.ceph.com/issues/46847. Might even be related. It
looks like peering can loose track of PG members in certain
situations (specifically after adding OSDs until rebalancing
completed). In my cases, I get degraded objects even though
everything is obviously still around. Flipping between the
crush-maps before/after the change re-discovers everything again.
Issue 46847 is long-standing and still unresolved. In case you need
to file a tracker, please consider to refer to the two above as
well as "might be related" if you deem that they might be related.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx