Re: ceph-mgr high CPU utilization

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 8 May 2020 11:31:11 +0200

If an upmap is not stored, it means that OSDMap::check_pg_upmaps is
deciding that those upmaps are invalid for some reason.
Additional debugging can help sort out why.
(Maybe you have a complex crush tree and the balancer is creating
invalid upmaps).

-- dan

On Fri, May 1, 2020 at 2:48 PM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Also just a follow-up on the misbehavior of ceph-mgr.  It looks like the
> upmap balancer is not acting reasonably either.  It is trying to create
> upmap entries every minute or so - and claims to be successful, but they
> never show up in the OSD map.  Setting the logging to 'debug', I see
> upmap entries created such as:
>
> 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> pg-upmap-items 9.60c4 mappings [{'to': 3313L, 'from': 3371L}]
> 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> pg-upmap-items 9.632b mappings [{'to': 2187L, 'from': 1477L}]
> 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> pg-upmap-items 9.6b9c mappings [{'to': 3315L, 'from': 3371L}]
> 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> pg-upmap-items 9.6bf6 mappings [{'to': 1581L, 'from': 1477L}]
> 2020-05-01 08:43:07.909 7fffca074700  4 mgr[balancer] ceph osd
> pg-upmap-items 9.7da4 mappings [{'to': 2419L, 'from': 2537L}]
> ...
> 2020-05-01 08:43:07.909 7fffca074700 20 mgr[balancer] commands
> [<mgr_module.CommandResult object at 0x7fffcc990550>,
> <mgr_module.CommandResult object at 0x7fffcc990fd0>,
> <mgr_module.CommandResult object at 0x7fffcc9907d0>, <mgr_module.Com
> mandResult object at 0x7fffcc990650>, <mgr_module.CommandResult object
> at 0x7fffcc990610>, <mgr_module.CommandResult object at 0x7fffcc990f50>,
> <mgr_module.CommandResult object at 0x7fffcc990bd0>,
> <mgr_module.CommandResult object at 0x7ff
> fcc990d90>, <mgr_module.CommandResult object at 0x7fffcc990ad0>,
> <mgr_module.CommandResult object at 0x7fffcc990410>,
> <mgr_module.CommandResult object at 0x7fffbed241d0>,
> <mgr_module.CommandResult object at 0x7fff6a6caf90>, <mgr_module.Co
> mmandResult object at 0x7fffbed242d0>, <mgr_module.CommandResult object
> at 0x7fffbed24d90>, <mgr_module.CommandResult object at 0x7fffbed24d50>,
> <mgr_module.CommandResult object at 0x7fffbed24550>,
> <mgr_module.CommandResult object at 0x7f
> ffbed245d0>, <mgr_module.CommandResult object at 0x7fffbed24510>,
> <mgr_module.CommandResult object at 0x7fffbed24690>,
> <mgr_module.CommandResult object at 0x7fffbed24990>]
> ...
> 2020-05-01 08:43:16.733 7fffca074700 20 mgr[balancer] done
> ...
>
> but these mappings do not show up in the osd dump.  And a minute later,
> the balancer tries again and comes up with a set of very similar
> mappings (same from and to OSDs, slightly different PG numbers) - and
> keeps going like that every minute without any progress (the set of
> upmap entries stays the same, does not increase).
>
> Andras
>
>
> On 5/1/20 8:12 AM, Andras Pataki wrote:
> > I'm wondering if anyone still sees issues with ceph-mgr using CPU and
> > being unresponsive even in recent Nautilus releases.  We upgraded our
> > largest cluster from Mimic to Nautilus (14.2.8) recently - it has
> > about 3500 OSDs.  Now ceph-mgr is constantly at 100-200% CPU (1-2
> > cores), and becomes unresponsive after a few minutes.  The
> > finisher-Mgr queue length grows (I've seen it at over 100k) - similar
> > symptoms as seen with earlier Nautilus releases by many. This is what
> > it looks like after an hour of running:
> >
> >     "finisher-Mgr": {
> >         "queue_len": 66078,
> >         "complete_latency": {
> >             "avgcount": 21,
> >             "sum": 2098.408767721,
> >             "avgtime": 99.924227034
> >         }
> >     },
> >
> > We have a pretty vanilla manager config, only the balancer is enabled
> > in upmap mode.  Here are the enabled modules:
> >
> >     "always_on_modules": [
> >         "balancer",
> >         "crash",
> >         "devicehealth",
> >         "orchestrator_cli",
> >         "progress",
> >         "rbd_support",
> >         "status",
> >         "volumes"
> >     ],
> >     "enabled_modules": [
> >         "restful"
> >     ],
> >
> > Any ideas or outstanding issues in this area?
> >
> > Andras
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx