Hi,
this sounds a bit like a customer issue we had almost two years ago.
Basically, it was about mon_max_pg_per_osd (default 250) which was
exceeded during the first activating OSD (and the last remaining
stopping OSD). You can read all the details in the lengthy thread [1].
But if this is was the actual issue you probably should see something
like this in the logs:
2022-04-06 14:24:55.256 7f8bb5a0e700 1 osd.8 43377
maybe_wait_for_max_pg withhold creation of pg 75.56s16: 750 >= 750
In our case we did the opposite and removed an entire host. I'll just
quote Josh's explanation from the mentioned thread:
1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)
Note that during backfill/recovery the limit is 750
(mon_max_pg_per_osd * osd_max_pg_per_osd_hard_ratio ==> 250 * 3 =
750). As a workaround we increased osd_max_pg_per_osd_hard_ratio to 5
and the issue was never seen again.
Can you check the logs for that message?
Regards,
Eugen
[1] https://www.spinics.net/lists/ceph-users/msg71933.html
Zitat von Ruben Vestergaard <rubenv@xxxxxxxx>:
Hi
We have a cluster with which currently looks like so:
services:
mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d)
mgr: jolly.tpgixt(active, since 25h), standbys: dopey.lxajvk,
lazy.xuhetq
mds: 1/1 daemons up, 2 standby
osd: 449 osds: 425 up (since 15m), 425 in (since 5m); 5104 remapped pgs
data:
volumes: 1/1 healthy
pools: 13 pools, 11153 pgs
objects: 304.11M objects, 988 TiB
usage: 1.6 PiB used, 1.4 PiB / 2.9 PiB avail
pgs: 6/1617270006 objects degraded (0.000%)
366696947/1617270006 objects misplaced (22.674%)
6043 active+clean
5041 active+remapped+backfill_wait
66 active+remapped+backfilling
2 active+recovery_wait+degraded+remapped
1 active+recovering+degraded
It's currently rebalancing after adding a node, but this rebalance
has been rather slow -- right now it's running 66 backfills, but it
seems to stabilize around 8 backfills eventually. We figured that
perhaps adding another node might speed things up.
Immediately upon adding the node, we get slow ops and inactive PG's.
Removing the new node gets us back in working order.
It turns out that even adding 1 OSD breaks the cluster, and
immediately sends it here:
[WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712
objects degraded (0.000%), 3 pgs degraded
pg 37.c8 is active+recovery_wait+degraded+remapped, acting
[410,163,236,209,7,283,155,143,78]
pg 37.1a1 is active+recovering+degraded, acting
[234,424,163,74,22,128,177,153,181]
pg 37.1da is active+recovery_wait+degraded+remapped, acting
[163,408,230,190,93,284,50,78,44]
[WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54 sec,
daemons
[osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd.136,osd.144,osd.157]... have slow
ops.
The OSD added had number 431, so it does not appear to be the
immediate cause of the slow ops, however, removing 431 immediately
clears the problem.
We thought we might be experiencing 'Crush giving up too soon'
symptoms [1], as we have seen similar behaviour on another pool, but
it does not appear to be the case here. We went through the motions
described on the page and everything looked OK.
At least one pool which stops working is a 4+2 EC pool, placed on
spinning rust, some 200-ish disks distributed across 13 nodes. I'm
not sure if other pools break, but that particular 4+2 EC pool is
rather important so I'm a little wary of experimenting blindly.
Any thoughts on where to look next?
Thanks,
Ruben Vestergaard
[1]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx