I'm wondering if anyone still sees issues with ceph-mgr using CPU and
being unresponsive even in recent Nautilus releases. We upgraded our
largest cluster from Mimic to Nautilus (14.2.8) recently - it has about
3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2 cores), and
becomes unresponsive after a few minutes. The finisher-Mgr queue length
grows (I've seen it at over 100k) - similar symptoms as seen with
earlier Nautilus releases by many. This is what it looks like after an
hour of running:
"finisher-Mgr": {
"queue_len": 66078,
"complete_latency": {
"avgcount": 21,
"sum": 2098.408767721,
"avgtime": 99.924227034
}
},
We have a pretty vanilla manager config, only the balancer is enabled in
upmap mode. Here are the enabled modules:
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator_cli",
"progress",
"rbd_support",
"status",
"volumes"
],
"enabled_modules": [
"restful"
],
Any ideas or outstanding issues in this area?
Andras
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx