Squid Manager Daemon: balancer crashing orchestrator and dashboard

laimis.juzeliunas@xxxxxxxxxx · Fri, 18 Oct 2024 14:06:01 -0000

Hello community,
We are facing one issue after migrating from Reef 18.4.2 to Squid 19.2.0 with the Ceph manager daemon and were wondering if anyone has already faced this or could guide us where to look further. When turning on balancer (upmap mode) it hangs our mgr completely most of the time and leaves orchestrator as well as the dashboard/UI unresponsive. We noticed this initially during the upgrade (as mgr is the first in line) and had to continue with the balancer turned off. Post upgrade we still need to keep it turned off. We run docker daemons with cephadm as the orchestrator, health status is always HEALTH_OK.

A few observations:
ceph balancer on -> mgr logs show some pg_upmap's being performed
debug logs show that balancer is done -> mgr stops working
only pgmap debug logs remain on container
after some (10-20 minutes) time mgr fails over to standby node
mgr starts and begins full cluster inventory (services, disks, daemons, networks, etc)
dashboard starts up, orch commands begin working
balancer kicks in, performs some upmaps and the cycle continues failing over to standby node afrer quite some time
while mgr is not working TCP port is showing as being listened to (netstat), but does not even respond to telnet. but when mgr is working - we can query it with curl

Cluster overview:
  cluster:
    id:     96df99f6-fc1a-11ea-90a4-6cb3113cb732
    health: HEALTH_OK

  services:
    mon:        5 daemons, quorum ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 3d)
    mgr:        ceph-node001.hgythj(active, since 24h), standbys: ceph-node002.jphtvg
    mds:        21/21 daemons up, 12 standby
    osd:        384 osds: 384 up (since 2d), 384 in (since 7d)
    rbd-mirror: 2 daemons active (2 hosts)
    rgw:        64 daemons active (32 hosts, 1 zones)

  data:
    volumes: 2/2 healthy
    pools:   16 pools, 12793 pgs
    objects: 751.20M objects, 1.4 PiB
    usage:   4.5 PiB used, 1.1 PiB / 5.6 PiB avail
    pgs:     12410 active+clean
             285   active+clean+scrubbing
             98    active+clean+scrubbing+deep

  io:
    client:   5.8 GiB/s rd, 169 MiB/s wr, 43.77k op/s rd, 9.92k op/s wr

We will be able to provide a more detailed log sequence but for now we see these entries:
debug 2024-10-18T13:23:34.478+0000 7fad971ab640  0 [balancer DEBUG root] Waking up [active, now 2024-10-18_13:23:34]
debug 2024-10-18T13:23:34.478+0000 7fad971ab640  0 [balancer DEBUG root] Running
debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root] Optimize plan auto_2024-10-18_13:23:34
debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root] Mode upmap, max misplaced 0.050000
debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer DEBUG root] unknown 0.000000 degraded 0.000000 inactive 0.000000 misplaced 0
debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root] do_upmap
debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root] pools ['cephfs.our-awesome-pool1.meta', 'our-awesome-k8s.ceph-csi.rbd', 'cephfs.our-awesome-poo2l.data', 'cephfs.our-awesome-pool2.meta', 'europe-1.rgw.buckets.non-ec', 'europe-1.rgw.buckets.data', 'our-awesome-k8s2.ceph-csi.rbd', 'europe-1.rgw.log', '.mgr', 'our-awesome-k8s4.ceph-csi.rbd', 'europe-1.rgw.control', 'our-awesome-k8s3.ceph-csi.rbd', 'europe-1.rgw.meta', 'europe-1.rgw.buckets.index', '.rgw.root', 'cephfs.our-awesome-pool1.data']
debug 2024-10-18T13:23:34.814+0000 7fad971ab640  0 [balancer INFO root] prepared 10/10 upmap changes
debug 2024-10-18T13:23:34.814+0000 7fad971ab640  0 [balancer INFO root] Executing plan auto_2024-10-18_13:23:34
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.148 mappings [{'from': 113, 'to': 94}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.302 mappings [{'from': 138, 'to': 128}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.34e mappings [{'from': 92, 'to': 89}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.504 mappings [{'from': 156, 'to': 94}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.836 mappings [{'from': 148, 'to': 54}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.e4c mappings [{'from': 157, 'to': 54}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.e56 mappings [{'from': 147, 'to': 186}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.e63 mappings [{'from': 79, 'to': 31}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.ed9 mappings [{'from': 158, 'to': 237}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root] ceph osd pg-upmap-items 44.f1e mappings [{'from': 153, 'to': 237}]
debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer DEBUG root] commands [<mgr_module.CommandResult object at 0x7fae2e05b700>, <mgr_module.CommandResult object at 0x7fad2d889ca0>, <mgr_module.CommandResult object at 0x7fad2d810340>, <mgr_module.CommandResult object at 0x7fad2d88c5e0>, <mgr_module.CommandResult object at 0x7fad2d418130>, <mgr_module.CommandResult object at 0x7fad2d418ac0>, <mgr_module.CommandResult object at 0x7fae2f8d7cd0>, <mgr_module.CommandResult object at 0x7fad2d448520>, <mgr_module.CommandResult object at 0x7fad2d4480d0>, <mgr_module.CommandResult object at 0x7fad2d448fa0>]
162.55.93.25 - - [18/Oct/2024:13:23:35] "GET /metrics HTTP/1.1" 200 7731011 "" "Prometheus/2.43.0"
...
debug 2024-10-18T13:23:36.110+0000 7fad971ab640  0 [balancer DEBUG root] done

We were suspecting that this might be caused by older ceph-clients connecting and getting identified (wrongly) as Luminous by Ceph due to the fresh upmap-read balancer mode (I think it came with Squid but I might be wrong) that had to do something in the background even when disabled. However setting set-require-min-compat-client did not help our case so we dismissed this assumption.

We would be beyond happy for any advice where to look further as having no balancer is sad.
If anyone would like to go through detailed logs we are glad to provide.

Best,
Laimis J.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx