Re: High CPU usage by ceph-mgr in 14.2.5

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Wed, 18 Dec 2019 17:46:30 -0500



    We are also running into this issue on one of our clusters -
    balancer mode upmap, about 950 OSDs.

    
    Andras

    
    On 12/18/19 4:44 PM, Bryan Stillwell
      wrote:

    
      On Dec 18, 2019, at 11:58 AM, Sage Weil <sage@xxxxxxxxxxxx>
      wrote:

      
          On Wed, 18 Dec 2019, Bryan Stillwell
              wrote:

            
              After upgrading one of our clusters from Nautilus 14.2.2
              to Nautilus 14.2.5 I'm seeing 100% CPU usage by a single
              ceph-mgr thread (found using 'top -H').  Attaching to the
              thread with strace shows a lot of mmap and munmap calls.
               Here's the distribution after watching it for a few
              minutes:

              
              48.73% - mmap

              49.48% - munmap

              1.75% - futex

              0.05% - madvise

              
              I've upgraded 3 other clusters so far (120 OSDs, 30 OSDs,
              200 OSDs), but this is the only one which has seen the
              problem (355 OSDs). Perhaps it has something to do with
              its size?

              
              I was suspecting it might have to do with one of the
              modules misbehaving, so I disabled all of them:

              
              # ceph mgr module ls | jq -r '.enabled_modules'

              []

              
              But that didn't help (I restarted the mgrs after disabling
              the modules too).

              
              I also tried setting debug_mgr and debug_mgrc to 20, but
              nothing popped out at me as being the cause of the
              problem.

              
              It only seems to affect the active mgr.  If I stop the
              active mgr the problem moves to one of the other mgrs.

              
              Any guesses or tips on what next steps I should take to
              figure out what's going on?

            
            What are the balancer modes on the
              affected and unaffected cluster(s)?

          
      Affected cluster has a balancer mode of "none".
      

      The other three are "upmap", "none", and "upmap".
      

      I don't know if you saw in ceph-users, but this bug
        report seems to point at the finisher-Mgr thread:
      

      https://tracker.ceph.com/issues/43364
      

      Thanks,
      Bryan
      

      _______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

    
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx