Also just a follow-up on the misbehavior of ceph-mgr. It looks like
the upmap balancer is not acting reasonably either. It is trying to
create upmap entries every minute or so - and claims to be successful,
but they never show up in the OSD map. Setting the logging to
'debug', I see upmap entries created such as:
2020-05-01 08:43:07.909 7fffca074700 4 mgr[balancer] ceph osd
pg-upmap-items 9.60c4 mappings [{'to': 3313L, 'from': 3371L}]
2020-05-01 08:43:07.909 7fffca074700 4 mgr[balancer] ceph osd
pg-upmap-items 9.632b mappings [{'to': 2187L, 'from': 1477L}]
2020-05-01 08:43:07.909 7fffca074700 4 mgr[balancer] ceph osd
pg-upmap-items 9.6b9c mappings [{'to': 3315L, 'from': 3371L}]
2020-05-01 08:43:07.909 7fffca074700 4 mgr[balancer] ceph osd
pg-upmap-items 9.6bf6 mappings [{'to': 1581L, 'from': 1477L}]
2020-05-01 08:43:07.909 7fffca074700 4 mgr[balancer] ceph osd
pg-upmap-items 9.7da4 mappings [{'to': 2419L, 'from': 2537L}]
...
2020-05-01 08:43:07.909 7fffca074700 20 mgr[balancer] commands
[<mgr_module.CommandResult object at 0x7fffcc990550>,
<mgr_module.CommandResult object at 0x7fffcc990fd0>,
<mgr_module.CommandResult object at 0x7fffcc9907d0>, <mgr_module.Com
mandResult object at 0x7fffcc990650>, <mgr_module.CommandResult object
at 0x7fffcc990610>, <mgr_module.CommandResult object at
0x7fffcc990f50>, <mgr_module.CommandResult object at 0x7fffcc990bd0>,
<mgr_module.CommandResult object at 0x7ff
fcc990d90>, <mgr_module.CommandResult object at 0x7fffcc990ad0>,
<mgr_module.CommandResult object at 0x7fffcc990410>,
<mgr_module.CommandResult object at 0x7fffbed241d0>,
<mgr_module.CommandResult object at 0x7fff6a6caf90>, <mgr_module.Co
mmandResult object at 0x7fffbed242d0>, <mgr_module.CommandResult
object at 0x7fffbed24d90>, <mgr_module.CommandResult object at
0x7fffbed24d50>, <mgr_module.CommandResult object at 0x7fffbed24550>,
<mgr_module.CommandResult object at 0x7f
ffbed245d0>, <mgr_module.CommandResult object at 0x7fffbed24510>,
<mgr_module.CommandResult object at 0x7fffbed24690>,
<mgr_module.CommandResult object at 0x7fffbed24990>]
...
2020-05-01 08:43:16.733 7fffca074700 20 mgr[balancer] done
...
but these mappings do not show up in the osd dump. And a minute
later, the balancer tries again and comes up with a set of very
similar mappings (same from and to OSDs, slightly different PG
numbers) - and keeps going like that every minute without any progress
(the set of upmap entries stays the same, does not increase).
Andras
On 5/1/20 8:12 AM, Andras Pataki wrote:
I'm wondering if anyone still sees issues with ceph-mgr using CPU and
being unresponsive even in recent Nautilus releases. We upgraded our
largest cluster from Mimic to Nautilus (14.2.8) recently - it has
about 3500 OSDs. Now ceph-mgr is constantly at 100-200% CPU (1-2
cores), and becomes unresponsive after a few minutes. The
finisher-Mgr queue length grows (I've seen it at over 100k) - similar
symptoms as seen with earlier Nautilus releases by many. This is what
it looks like after an hour of running:
"finisher-Mgr": {
"queue_len": 66078,
"complete_latency": {
"avgcount": 21,
"sum": 2098.408767721,
"avgtime": 99.924227034
}
},
We have a pretty vanilla manager config, only the balancer is enabled
in upmap mode. Here are the enabled modules:
"always_on_modules": [
"balancer",
"crash",
"devicehealth",
"orchestrator_cli",
"progress",
"rbd_support",
"status",
"volumes"
],
"enabled_modules": [
"restful"
],
Any ideas or outstanding issues in this area?
Andras