Hi Laimis, Thanks for reporting. Can you please raise a tracker ticket and attach the mgr and mon logs? Can you bump up the logging level in the balancer module with `ceph config set mgr mgr/balancer/ debug` and the mon logs with `ceph config set mon.* debug_mon 20`? Thanks, Laura On Fri, Oct 18, 2024 at 10:53 AM <laimis.juzeliunas@xxxxxxxxxx> wrote: > Hello community, > We are facing one issue after migrating from Reef 18.4.2 to Squid 19.2.0 > with the Ceph manager daemon and were wondering if anyone has already faced > this or could guide us where to look further. When turning on balancer > (upmap mode) it hangs our mgr completely most of the time and leaves > orchestrator as well as the dashboard/UI unresponsive. We noticed this > initially during the upgrade (as mgr is the first in line) and had to > continue with the balancer turned off. Post upgrade we still need to keep > it turned off. We run docker daemons with cephadm as the orchestrator, > health status is always HEALTH_OK. > > A few observations: > ceph balancer on -> mgr logs show some pg_upmap's being performed > debug logs show that balancer is done -> mgr stops working > only pgmap debug logs remain on container > after some (10-20 minutes) time mgr fails over to standby node > mgr starts and begins full cluster inventory (services, disks, daemons, > networks, etc) > dashboard starts up, orch commands begin working > balancer kicks in, performs some upmaps and the cycle continues failing > over to standby node afrer quite some time > while mgr is not working TCP port is showing as being listened to > (netstat), but does not even respond to telnet. but when mgr is working - > we can query it with curl > > Cluster overview: > cluster: > id: 96df99f6-fc1a-11ea-90a4-6cb3113cb732 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum > ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 3d) > mgr: ceph-node001.hgythj(active, since 24h), standbys: > ceph-node002.jphtvg > mds: 21/21 daemons up, 12 standby > osd: 384 osds: 384 up (since 2d), 384 in (since 7d) > rbd-mirror: 2 daemons active (2 hosts) > rgw: 64 daemons active (32 hosts, 1 zones) > > data: > volumes: 2/2 healthy > pools: 16 pools, 12793 pgs > objects: 751.20M objects, 1.4 PiB > usage: 4.5 PiB used, 1.1 PiB / 5.6 PiB avail > pgs: 12410 active+clean > 285 active+clean+scrubbing > 98 active+clean+scrubbing+deep > > io: > client: 5.8 GiB/s rd, 169 MiB/s wr, 43.77k op/s rd, 9.92k op/s wr > > > We will be able to provide a more detailed log sequence but for now we see > these entries: > debug 2024-10-18T13:23:34.478+0000 7fad971ab640 0 [balancer DEBUG root] > Waking up [active, now 2024-10-18_13:23:34] > debug 2024-10-18T13:23:34.478+0000 7fad971ab640 0 [balancer DEBUG root] > Running > debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] > Optimize plan auto_2024-10-18_13:23:34 > debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] > Mode upmap, max misplaced 0.050000 > debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer DEBUG root] > unknown 0.000000 degraded 0.000000 inactive 0.000000 misplaced 0 > debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] > do_upmap > debug 2024-10-18T13:23:34.730+0000 7fad971ab640 0 [balancer INFO root] > pools ['cephfs.our-awesome-pool1.meta', 'our-awesome-k8s.ceph-csi.rbd', > 'cephfs.our-awesome-poo2l.data', 'cephfs.our-awesome-pool2.meta', > 'europe-1.rgw.buckets.non-ec', 'europe-1.rgw.buckets.data', > 'our-awesome-k8s2.ceph-csi.rbd', 'europe-1.rgw.log', '.mgr', > 'our-awesome-k8s4.ceph-csi.rbd', 'europe-1.rgw.control', > 'our-awesome-k8s3.ceph-csi.rbd', 'europe-1.rgw.meta', > 'europe-1.rgw.buckets.index', '.rgw.root', 'cephfs.our-awesome-pool1.data'] > debug 2024-10-18T13:23:34.814+0000 7fad971ab640 0 [balancer INFO root] > prepared 10/10 upmap changes > debug 2024-10-18T13:23:34.814+0000 7fad971ab640 0 [balancer INFO root] > Executing plan auto_2024-10-18_13:23:34 > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.148 mappings [{'from': 113, 'to': 94}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.302 mappings [{'from': 138, 'to': 128}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.34e mappings [{'from': 92, 'to': 89}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.504 mappings [{'from': 156, 'to': 94}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.836 mappings [{'from': 148, 'to': 54}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.e4c mappings [{'from': 157, 'to': 54}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.e56 mappings [{'from': 147, 'to': 186}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.e63 mappings [{'from': 79, 'to': 31}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.ed9 mappings [{'from': 158, 'to': 237}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer INFO root] > ceph osd pg-upmap-items 44.f1e mappings [{'from': 153, 'to': 237}] > debug 2024-10-18T13:23:34.818+0000 7fad971ab640 0 [balancer DEBUG root] > commands [<mgr_module.CommandResult object at 0x7fae2e05b700>, > <mgr_module.CommandResult object at 0x7fad2d889ca0>, > <mgr_module.CommandResult object at 0x7fad2d810340>, > <mgr_module.CommandResult object at 0x7fad2d88c5e0>, > <mgr_module.CommandResult object at 0x7fad2d418130>, > <mgr_module.CommandResult object at 0x7fad2d418ac0>, > <mgr_module.CommandResult object at 0x7fae2f8d7cd0>, > <mgr_module.CommandResult object at 0x7fad2d448520>, > <mgr_module.CommandResult object at 0x7fad2d4480d0>, > <mgr_module.CommandResult object at 0x7fad2d448fa0>] > 162.55.93.25 - - [18/Oct/2024:13:23:35] "GET /metrics HTTP/1.1" 200 > 7731011 "" "Prometheus/2.43.0" > ... > debug 2024-10-18T13:23:36.110+0000 7fad971ab640 0 [balancer DEBUG root] > done > > > We were suspecting that this might be caused by older ceph-clients > connecting and getting identified (wrongly) as Luminous by Ceph due to the > fresh upmap-read balancer mode (I think it came with Squid but I might be > wrong) that had to do something in the background even when disabled. > However setting set-require-min-compat-client did not help our case so we > dismissed this assumption. > > > We would be beyond happy for any advice where to look further as having no > balancer is sad. > If anyone would like to go through detailed logs we are glad to provide. > > Best, > Laimis J. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- Laura Flores She/Her/Hers Software Engineer, Ceph Storage <https://ceph.io> Chicago, IL lflores@xxxxxxx | lflores@xxxxxxxxxx <lflores@xxxxxxxxxx> M: +17087388804 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx