Re: High CPU usage by ceph-mgr on idle Ceph cluster

Donny Davis <donny@xxxxxxxxxxxxxx> · Mon, 13 Feb 2017 08:18:33 -0500

I am having the same issue. When I looked at my idle cluster this morning, one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.  I have 3 AIO nodes, and only one of them seemed to be affected. 

On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
Want to install debuginfo packages and use something like this to try

and find out where it is spending most of its time?

https://poormansprofiler.org/

Note that you may need to do multiple runs to get a "feel" for where

it is spending most of its time. Also not that likely only one or two

threads will be using the CPU (you can see this in ps output using a

command like the following) the rest will likely be idle or waiting

for something.

# ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan

Observation of these two and maybe a couple of manual gstack dumps

like this to compare thread ids to ps output (LWP is the thread id

(tid) in gdb output) should give us some idea of where it is spinning.

# gstack $(pidof ceph-mgr)

On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff

<robert.longstaff@xxxxxxxxx> wrote:

> FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS 7 w/

> elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has

> allocated ~11GB of RAM after a single day of usage. Only the active manager

> is performing this way. The growth is linear and reproducible.

>

> The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB OSDs

> each.

>

>

> top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21

>

> Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie

>

> %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0

> st

>

> KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812 buff/cache

>

> KiB Swap:  2097148 total,  2097148 free,        0 used.  4836772 avail Mem

>

>

>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

>

>  2351 ceph      20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27 ceph-mgr

>

>  2302 ceph      20   0  620316 267992 157620 S   2.3  1.6  65:11.50 ceph-mon

>

>

> On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J

> <Bryan.Stillwell@xxxxxxxxxxx> wrote:

>>

>> John,

>>

>> This morning I compared the logs from yesterday and I show a noticeable

>> increase in messages like these:

>>

>> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575

>> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441

>> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all mon_status

>> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all health

>> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all pg_summary

>> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active

>> mgrdigest v1

>> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1

>> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575

>> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441

>> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all mon_status

>> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all health

>> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:

>> notify_all pg_summary

>> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active

>> mgrdigest v1

>> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1

>>

>>

>> In a 1 minute period yesterday I saw 84 times this group of messages

>> showed up.  Today that same group of messages showed up 156 times.

>>

>> Other than that I did see an increase in this messages from 9 times a

>> minute to 14 times a minute:

>>

>> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -

>> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0

>> l=0).fault with nothing to send and in the half  accept state just closed

>>

>> Let me know if you need anything else.

>>

>> Bryan

>>

>>

>> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"

>> <ceph-users-bounces@lists.ceph.com on behalf of

>> Bryan.Stillwell@xxxxxxxxxxx> wrote:

>>

>> >On 1/10/17, 5:35 AM, "John Spray" <jspray@xxxxxxxxxx> wrote:

>> >

>> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J

>> >><Bryan.Stillwell@xxxxxxxxxxx> wrote:

>> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a

>> >>> single node, two OSD cluster, and after a while I noticed that the new

>> >>> ceph-mgr daemon is frequently using a lot of the CPU:

>> >>>

>> >>> 17519 ceph      20   0  850044 168104    208 S 102.7  4.3   1278:27

>> >>> ceph-mgr

>> >>>

>> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU

>> >>> usage down to < 1%, but after a while it climbs back up to > 100%.

>> >>> Has

>> >>> anyone else seen this?

>> >>

>> >>Definitely worth investigating, could you set "debug mgr = 20" on the

>> >>daemon to see if it's obviously spinning in a particular place?

>> >

>> >I've injected that option to the ceps-mgr process, and now I'm just

>> >waiting for it to go out of control again.

>> >

>> >However, I've noticed quite a few messages like this in the logs already:

>> >

>> >2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>

>> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2

>> >cs=1 l=0).fault initiating reconnect

>> >2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>

>> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800

>> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0

>> > l=0).handle_connect_msg

>> >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING

>> >2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>

>> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800

>> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0

>> > l=0).handle_connect_msg

>> >accept peer reset, then tried to connect to us, replacing

>> >2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>

>> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800

>> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to

>> >send and in the half  accept state just closed

>> >

>> >

>> >What's weird about that is that this is a single node cluster with

>> >ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same

>> >host.  So none of the communication should be leaving the node.

>> >

>> >Bryan

>>

>> E-MAIL CONFIDENTIALITY NOTICE:

>> The contents of this e-mail message and any attachments are intended

>> solely for the addressee(s) and may contain confidential and/or legally

>> privileged information. If you are not the intended recipient of this

>> message or if this message has been addressed to you in error, please

>> immediately alert the sender by reply e-mail and then delete this message

>> and any attachments. If you are not the intended recipient, you are notified

>> that any use, dissemination, distribution, copying, or storage of this

>> message or any attachment is strictly prohibited.

>>

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

>

>

> --

> - Rob

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

--

Cheers,

Brad

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com