Re: ceph-mgr high CPU utilization

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Fri, 1 May 2020 08:48:17 -0400

Also just a follow-up on the misbehavior of ceph-mgr.  It looks like the 
upmap balancer is not acting reasonably either.  It is trying to create 
upmap entries every minute or so - and claims to be successful, but they 
never show up in the OSD map.  Setting the logging to 'debug', I see 
upmap entries created such as:

2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd 
pg-upmap-items 9.60c4 mappings [{'to': 3313L, 'from': 3371L}]
2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd 
pg-upmap-items 9.632b mappings [{'to': 2187L, 'from': 1477L}]
2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd 
pg-upmap-items 9.6b9c mappings [{'to': 3315L, 'from': 3371L}]
2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd 
pg-upmap-items 9.6bf6 mappings [{'to': 1581L, 'from': 1477L}]
2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd 
pg-upmap-items 9.7da4 mappings [{'to': 2419L, 'from': 2537L}]
...
2020-05-01 08:43:07.909 7fffca074700 20 mgr[balancer] commands 
[<mgr_module.CommandResult object at 0x7fffcc990550>, 
<mgr_module.CommandResult object at 0x7fffcc990fd0>, 
<mgr_module.CommandResult object at 0x7fffcc9907d0>, <mgr_module.Com
mandResult object at 0x7fffcc990650>, <mgr_module.CommandResult object 
at 0x7fffcc990610>, <mgr_module.CommandResult object at 0x7fffcc990f50>, 
<mgr_module.CommandResult object at 0x7fffcc990bd0>, 
<mgr_module.CommandResult object at 0x7ff
fcc990d90>, <mgr_module.CommandResult object at 0x7fffcc990ad0>, 
<mgr_module.CommandResult object at 0x7fffcc990410>, 
<mgr_module.CommandResult object at 0x7fffbed241d0>, 
<mgr_module.CommandResult object at 0x7fff6a6caf90>, <mgr_module.Co
mmandResult object at 0x7fffbed242d0>, <mgr_module.CommandResult object 
at 0x7fffbed24d90>, <mgr_module.CommandResult object at 0x7fffbed24d50>, 
<mgr_module.CommandResult object at 0x7fffbed24550>, 
<mgr_module.CommandResult object at 0x7f
ffbed245d0>, <mgr_module.CommandResult object at 0x7fffbed24510>, 
<mgr_module.CommandResult object at 0x7fffbed24690>, 
<mgr_module.CommandResult object at 0x7fffbed24990>]
...
2020-05-01 08:43:16.733 7fffca074700 20 mgr[balancer] done
...

but these mappings do not show up in the osd dump.  And a minute later, 
the balancer tries again and comes up with a set of very similar 
mappings (same from and to OSDs, slightly different PG numbers) - and 
keeps going like that every minute without any progress (the set of 
upmap entries stays the same, does not increase).

Andras

On 5/1/20 8:12 AM, Andras Pataki wrote:
I'm wondering if anyone still sees issues with ceph-mgr using CPU and 
being unresponsive even in recent Nautilus releases.  We upgraded our 
largest cluster from Mimic to Nautilus (14.2.8) recently - it has 
about 3500 OSDs.  Now ceph-mgr is constantly at 100-200% CPU (1-2 
cores), and becomes unresponsive after a few minutes.  The 
finisher-Mgr queue length grows (I've seen it at over 100k) - similar 
symptoms as seen with earlier Nautilus releases by many. This is what 
it looks like after an hour of running:

    "finisher-Mgr": {
        "queue_len": 66078,
        "complete_latency": {
            "avgcount": 21,
            "sum": 2098.408767721,
            "avgtime": 99.924227034
        }
    },

We have a pretty vanilla manager config, only the balancer is enabled 
in upmap mode.  Here are the enabled modules:

    "always_on_modules": [
        "balancer",
        "crash",
        "devicehealth",
        "orchestrator_cli",
        "progress",
        "rbd_support",
        "status",
        "volumes"
    ],
    "enabled_modules": [
        "restful"
    ],

Any ideas or outstanding issues in this area?

Andras

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx