Hello community, We are facing one issue after migrating from Reef 18.4.2 to Squid 19.2.0 with the Ceph manager daemon and were wondering if anyone has already faced this or could guide us where to look further. When turning on balancer (upmap mode) it hangs our mgr completely most of the time and leaves orchestrator as well as the dashboard/UI unresponsive. We noticed this initially during the upgrade (as mgr is the first in line) and had to continue with the balancer turned off. Post upgrade we still need to keep it turned off. We run docker daemons with cephadm as the orchestrator, health status is always HEALTH_OK. A few observations: ceph balancer on -> mgr logs show some pg_upmap's being performed debug logs show that balancer is done -> mgr stops working only pgmap debug logs remain on container after some (10-20 minutes) time mgr fails over to standby node mgr starts and begins full cluster inventory (services, disks, daemons, networks, etc) dashboard starts up, orch commands begin working balancer kicks in, performs some upmaps and the cycle continues failing over to standby node afrer quite some time while mgr is not working TCP port is showing as being listened to (netstat), but does not even respond to telnet. but when mgr is working - we can query it with curl Cluster overview: cluster: id: 96df99f6-fc1a-11ea-90a4-6cb3113cb732 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 3d) mgr: ceph-node001.hgythj(active, since 24h), standbys: ceph-node002.jphtvg mds: 21/21 daemons up, 12 standby osd: 384 osds: 384 up (since 2d), 384 in (since 7d) rbd-mirror: 2 daemons active (2 hosts) rgw: 64 daemons active (32 hosts, 1 zones) data: volumes: 2/2 healthy pools: 16 pools, 12793 pgs objects: 751.20M objects, 1.4 PiB usage: 4.5 PiB used, 1.1 PiB / 5.6 PiB avail pgs: 12410 active+clean 285 active+clean+scrubbing 98 active+clean+scrubbing+deep io: client: 5.8 GiB/s rd, 169 MiB/s wr, 43.77k op/s rd, 9.92k op/s wr We will be able to provide a more detailed log sequence but for now we see these entries: debug 2024-10-18T13:23:34.478+0000 7fad971ab640 0 [balancer DEBUG root] Waking up [active, now 2024-10-18_13:23:34] debug 2024-10-18T13:23:34.478+0000 7fad971ab640 0 [balancer DEBUG root] Running debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] Optimize plan auto_2024-10-18_13:23:34 debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] Mode upmap, max misplaced 0.050000 debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer DEBUG root] unknown 0.000000 degraded 0.000000 inactive 0.000000 misplaced 0 debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] do_upmap debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] pools ['cephfs.our-awesome-pool1.meta', 'our-awesome-k8s.ceph-csi.rbd', 'cephfs.our-awesome-poo2l.data', 'cephfs.our-awesome-pool2.meta', 'europe-1.rgw.buckets.non-ec', 'europe-1.rgw.buckets.data', 'our-awesome-k8s2.ceph-csi.rbd', 'europe-1.rgw.log', '.mgr', 'our-awesome-k8s4.ceph-csi.rbd', 'europe-1.rgw.control', 'our-awesome-k8s3.ceph-csi.rbd', 'europe-1.rgw.meta', 'europe-1.rgw.buckets.index', '.rgw.root', 'cephfs.our-awesome-pool1.data'] debug 2024-10-18T13:23:34.814+0000 7fad971ab640 0 [balancer INFO root] prepared 10/10 upmap changes debug 2024-10-18T13:23:34.814+0000 7fad971ab640 0 [balancer INFO root] Executing plan auto_2024-10-18_13:23:34 debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.148 mappings [{'from': 113, 'to': 94}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.302 mappings [{'from': 138, 'to': 128}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.34e mappings [{'from': 92, 'to': 89}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.504 mappings [{'from': 156, 'to': 94}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.836 mappings [{'from': 148, 'to': 54}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.e4c mappings [{'from': 157, 'to': 54}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.e56 mappings [{'from': 147, 'to': 186}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.e63 mappings [{'from': 79, 'to': 31}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.ed9 mappings [{'from': 158, 'to': 237}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] ceph osd pg-upmap-items 44.f1e mappings [{'from': 153, 'to': 237}] debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer DEBUG root] commands [<mgr_module.CommandResult object at 0x7fae2e05b700>, <mgr_module.CommandResult object at 0x7fad2d889ca0>, <mgr_module.CommandResult object at 0x7fad2d810340>, <mgr_module.CommandResult object at 0x7fad2d88c5e0>, <mgr_module.CommandResult object at 0x7fad2d418130>, <mgr_module.CommandResult object at 0x7fad2d418ac0>, <mgr_module.CommandResult object at 0x7fae2f8d7cd0>, <mgr_module.CommandResult object at 0x7fad2d448520>, <mgr_module.CommandResult object at 0x7fad2d4480d0>, <mgr_module.CommandResult object at 0x7fad2d448fa0>] 162.55.93.25 - - [18/Oct/2024:13:23:35] "GET /metrics HTTP/1.1" 200 7731011 "" "Prometheus/2.43.0" ... debug 2024-10-18T13:23:36.110+0000 7fad971ab640 0 [balancer DEBUG root] done We were suspecting that this might be caused by older ceph-clients connecting and getting identified (wrongly) as Luminous by Ceph due to the fresh upmap-read balancer mode (I think it came with Squid but I might be wrong) that had to do something in the background even when disabled. However setting set-require-min-compat-client did not help our case so we dismissed this assumption. We would be beyond happy for any advice where to look further as having no balancer is sad. If anyone would like to go through detailed logs we are glad to provide. Best, Laimis J. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx