Re: Squid Manager Daemon: balancer crashing orchestrator and dashboard

Laura Flores <lflores@xxxxxxxxxx> · Fri, 18 Oct 2024 17:44:34 -0500

Hi Laimis,

Thanks for reporting. Can you please raise a tracker ticket and attach the
mgr and mon logs? Can you bump up the logging level in the balancer module
with `ceph config set mgr mgr/balancer/ debug` and the mon logs with `ceph
config set mon.* debug_mon 20`?

Thanks,
Laura

On Fri, Oct 18, 2024 at 10:53 AM <laimis.juzeliunas@xxxxxxxxxx> wrote:

> Hello community,
> We are facing one issue after migrating from Reef 18.4.2 to Squid 19.2.0
> with the Ceph manager daemon and were wondering if anyone has already faced
> this or could guide us where to look further. When turning on balancer
> (upmap mode) it hangs our mgr completely most of the time and leaves
> orchestrator as well as the dashboard/UI unresponsive. We noticed this
> initially during the upgrade (as mgr is the first in line) and had to
> continue with the balancer turned off. Post upgrade we still need to keep
> it turned off. We run docker daemons with cephadm as the orchestrator,
> health status is always HEALTH_OK.
>
> A few observations:
> ceph balancer on -> mgr logs show some pg_upmap's being performed
> debug logs show that balancer is done -> mgr stops working
> only pgmap debug logs remain on container
> after some (10-20 minutes) time mgr fails over to standby node
> mgr starts and begins full cluster inventory (services, disks, daemons,
> networks, etc)
> dashboard starts up, orch commands begin working
> balancer kicks in, performs some upmaps and the cycle continues failing
> over to standby node afrer quite some time
> while mgr is not working TCP port is showing as being listened to
> (netstat), but does not even respond to telnet. but when mgr is working -
> we can query it with curl
>
> Cluster overview:
>   cluster:
>     id:     96df99f6-fc1a-11ea-90a4-6cb3113cb732
>     health: HEALTH_OK
>
>   services:
>     mon:        5 daemons, quorum
> ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 3d)
>     mgr:        ceph-node001.hgythj(active, since 24h), standbys:
> ceph-node002.jphtvg
>     mds:        21/21 daemons up, 12 standby
>     osd:        384 osds: 384 up (since 2d), 384 in (since 7d)
>     rbd-mirror: 2 daemons active (2 hosts)
>     rgw:        64 daemons active (32 hosts, 1 zones)
>
>   data:
>     volumes: 2/2 healthy
>     pools:   16 pools, 12793 pgs
>     objects: 751.20M objects, 1.4 PiB
>     usage:   4.5 PiB used, 1.1 PiB / 5.6 PiB avail
>     pgs:     12410 active+clean
>              285   active+clean+scrubbing
>              98    active+clean+scrubbing+deep
>
>   io:
>     client:   5.8 GiB/s rd, 169 MiB/s wr, 43.77k op/s rd, 9.92k op/s wr
>
>
> We will be able to provide a more detailed log sequence but for now we see
> these entries:
> debug 2024-10-18T13:23:34.478+0000 7fad971ab640  0 [balancer DEBUG root]
> Waking up [active, now 2024-10-18_13:23:34]
> debug 2024-10-18T13:23:34.478+0000 7fad971ab640  0 [balancer DEBUG root]
> Running
> debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root]
> Optimize plan auto_2024-10-18_13:23:34
> debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root]
> Mode upmap, max misplaced 0.050000
> debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer DEBUG root]
> unknown 0.000000 degraded 0.000000 inactive 0.000000 misplaced 0
> debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root]
> do_upmap
> debug 2024-10-18T13:23:34.730+0000 7fad971ab640  0 [balancer INFO root]
> pools ['cephfs.our-awesome-pool1.meta', 'our-awesome-k8s.ceph-csi.rbd',
> 'cephfs.our-awesome-poo2l.data', 'cephfs.our-awesome-pool2.meta',
> 'europe-1.rgw.buckets.non-ec', 'europe-1.rgw.buckets.data',
> 'our-awesome-k8s2.ceph-csi.rbd', 'europe-1.rgw.log', '.mgr',
> 'our-awesome-k8s4.ceph-csi.rbd', 'europe-1.rgw.control',
> 'our-awesome-k8s3.ceph-csi.rbd', 'europe-1.rgw.meta',
> 'europe-1.rgw.buckets.index', '.rgw.root', 'cephfs.our-awesome-pool1.data']
> debug 2024-10-18T13:23:34.814+0000 7fad971ab640  0 [balancer INFO root]
> prepared 10/10 upmap changes
> debug 2024-10-18T13:23:34.814+0000 7fad971ab640  0 [balancer INFO root]
> Executing plan auto_2024-10-18_13:23:34
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.148 mappings [{'from': 113, 'to': 94}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.302 mappings [{'from': 138, 'to': 128}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.34e mappings [{'from': 92, 'to': 89}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.504 mappings [{'from': 156, 'to': 94}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.836 mappings [{'from': 148, 'to': 54}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.e4c mappings [{'from': 157, 'to': 54}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.e56 mappings [{'from': 147, 'to': 186}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.e63 mappings [{'from': 79, 'to': 31}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.ed9 mappings [{'from': 158, 'to': 237}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer INFO root]
> ceph osd pg-upmap-items 44.f1e mappings [{'from': 153, 'to': 237}]
> debug 2024-10-18T13:23:34.818+0000 7fad971ab640  0 [balancer DEBUG root]
> commands [<mgr_module.CommandResult object at 0x7fae2e05b700>,
> <mgr_module.CommandResult object at 0x7fad2d889ca0>,
> <mgr_module.CommandResult object at 0x7fad2d810340>,
> <mgr_module.CommandResult object at 0x7fad2d88c5e0>,
> <mgr_module.CommandResult object at 0x7fad2d418130>,
> <mgr_module.CommandResult object at 0x7fad2d418ac0>,
> <mgr_module.CommandResult object at 0x7fae2f8d7cd0>,
> <mgr_module.CommandResult object at 0x7fad2d448520>,
> <mgr_module.CommandResult object at 0x7fad2d4480d0>,
> <mgr_module.CommandResult object at 0x7fad2d448fa0>]
> 162.55.93.25 - - [18/Oct/2024:13:23:35] "GET /metrics HTTP/1.1" 200
> 7731011 "" "Prometheus/2.43.0"
> ...
> debug 2024-10-18T13:23:36.110+0000 7fad971ab640  0 [balancer DEBUG root]
> done
>
>
> We were suspecting that this might be caused by older ceph-clients
> connecting and getting identified (wrongly) as Luminous by Ceph due to the
> fresh upmap-read balancer mode (I think it came with Squid but I might be
> wrong) that had to do something in the background even when disabled.
> However setting set-require-min-compat-client did not help our case so we
> dismissed this assumption.
>
>
> We would be beyond happy for any advice where to look further as having no
> balancer is sad.
> If anyone would like to go through detailed logs we are glad to provide.
>
> Best,
> Laimis J.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>

-- 

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage <https://ceph.io>

Chicago, IL

lflores@xxxxxxx | lflores@xxxxxxxxxx <lflores@xxxxxxxxxx>
M: +17087388804
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx