Re: High CPU usage by ceph-mgr on idle Ceph cluster

Muthusamy Muthiah <muthiah.muthusamy@xxxxxxxxx> · Fri, 17 Feb 2017 11:57:35 +0530

On one our platform mgr uses 3 CPU cores . Is there a ticket available for this issue ?
Thanks,
Muthu

On 14 February 2017 at 03:13, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
Could one of the reporters open a tracker for this issue and attach

the requested debugging data?

On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis <donny@xxxxxxxxxxxxxx> wrote:

> I am having the same issue. When I looked at my idle cluster this morning,

> one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of that.  I

> have 3 AIO nodes, and only one of them seemed to be affected.

>

> On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

>>

>> Want to install debuginfo packages and use something like this to try

>> and find out where it is spending most of its time?

>>

>> https://poormansprofiler.org/

>>

>> Note that you may need to do multiple runs to get a "feel" for where

>> it is spending most of its time. Also not that likely only one or two

>> threads will be using the CPU (you can see this in ps output using a

>> command like the following) the rest will likely be idle or waiting

>> for something.

>>

>> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan

>>

>> Observation of these two and maybe a couple of manual gstack dumps

>> like this to compare thread ids to ps output (LWP is the thread id

>> (tid) in gdb output) should give us some idea of where it is spinning.

>>

>> # gstack $(pidof ceph-mgr)

>>

>>

>> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff

>> <robert.longstaff@xxxxxxxxx> wrote:

>> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on CentOS

>> > 7 w/

>> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU and has

>> > allocated ~11GB of RAM after a single day of usage. Only the active

>> > manager

>> > is performing this way. The growth is linear and reproducible.

>> >

>> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with 45x8TB

>> > OSDs

>> > each.

>> >

>> >

>> > top - 23:45:47 up 1 day,  1:32,  1 user,  load average: 3.56, 3.94, 4.21

>> >

>> > Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie

>> >

>> > %Cpu(s): 33.9 us, 28.1 sy,  0.0 ni, 37.3 id,  0.0 wa,  0.0 hi,  0.7 si,

>> > 0.0

>> > st

>> >

>> > KiB Mem : 16423844 total,  3980500 free, 11556532 used,   886812

>> > buff/cache

>> >

>> > KiB Swap:  2097148 total,  2097148 free,        0 used.  4836772 avail

>> > Mem

>> >

>> >

>> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+

>> > COMMAND

>> >

>> >  2351 ceph      20   0 12.160g 0.010t  17380 S 203.7 64.8   2094:27

>> > ceph-mgr

>> >

>> >  2302 ceph      20   0  620316 267992 157620 S   2.3  1.6  65:11.50

>> > ceph-mon

>> >

>> >

>> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J

>> > <Bryan.Stillwell@xxxxxxxxxxx> wrote:

>> >>

>> >> John,

>> >>

>> >> This morning I compared the logs from yesterday and I show a noticeable

>> >> increase in messages like these:

>> >>

>> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575

>> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441

>> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all mon_status

>> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all health

>> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all pg_summary

>> >> 2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active

>> >> mgrdigest v1

>> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1

>> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575

>> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441

>> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all mon_status

>> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all health

>> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:

>> >> notify_all pg_summary

>> >> 2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active

>> >> mgrdigest v1

>> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1

>> >>

>> >>

>> >> In a 1 minute period yesterday I saw 84 times this group of messages

>> >> showed up.  Today that same group of messages showed up 156 times.

>> >>

>> >> Other than that I did see an increase in this messages from 9 times a

>> >> minute to 14 times a minute:

>> >>

>> >> 2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104

>> >> >> -

>> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0

>> >> l=0).fault with nothing to send and in the half  accept state just

>> >> closed

>> >>

>> >> Let me know if you need anything else.

>> >>

>> >> Bryan

>> >>

>> >>

>> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"

>> >> <ceph-users-bounces@lists.ceph.com on behalf of

>> >> Bryan.Stillwell@xxxxxxxxxxx> wrote:

>> >>

>> >> >On 1/10/17, 5:35 AM, "John Spray" <jspray@xxxxxxxxxx> wrote:

>> >> >

>> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J

>> >> >><Bryan.Stillwell@xxxxxxxxxxx> wrote:

>> >> >>> Last week I decided to play around with Kraken (11.1.1-1xenial) on

>> >> >>> a

>> >> >>> single node, two OSD cluster, and after a while I noticed that the

>> >> >>> new

>> >> >>> ceph-mgr daemon is frequently using a lot of the CPU:

>> >> >>>

>> >> >>> 17519 ceph      20   0  850044 168104    208 S 102.7  4.3   1278:27

>> >> >>> ceph-mgr

>> >> >>>

>> >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its

>> >> >>> CPU

>> >> >>> usage down to < 1%, but after a while it climbs back up to > 100%.

>> >> >>> Has

>> >> >>> anyone else seen this?

>> >> >>

>> >> >>Definitely worth investigating, could you set "debug mgr = 20" on the

>> >> >>daemon to see if it's obviously spinning in a particular place?

>> >> >

>> >> >I've injected that option to the ceps-mgr process, and now I'm just

>> >> >waiting for it to go out of control again.

>> >> >

>> >> >However, I've noticed quite a few messages like this in the logs

>> >> > already:

>> >> >

>> >> >2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> >> > >>

>> >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN

>> >> > pgs=2

>> >> >cs=1 l=0).fault initiating reconnect

>> >> >2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> >> > >>

>> >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800

>> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0

>> >> > l=0).handle_connect_msg

>> >> >accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING

>> >> >2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> >> > >>

>> >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800

>> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0

>> >> > l=0).handle_connect_msg

>> >> >accept peer reset, then tried to connect to us, replacing

>> >> >2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104

>> >> > >>

>> >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800

>> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing

>> >> > to

>> >> >send and in the half  accept state just closed

>> >> >

>> >> >

>> >> >What's weird about that is that this is a single node cluster with

>> >> >ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same

>> >> >host.  So none of the communication should be leaving the node.

>> >> >

>> >> >Bryan

>> >>

>> >> E-MAIL CONFIDENTIALITY NOTICE:

>> >> The contents of this e-mail message and any attachments are intended

>> >> solely for the addressee(s) and may contain confidential and/or legally

>> >> privileged information. If you are not the intended recipient of this

>> >> message or if this message has been addressed to you in error, please

>> >> immediately alert the sender by reply e-mail and then delete this

>> >> message

>> >> and any attachments. If you are not the intended recipient, you are

>> >> notified

>> >> that any use, dissemination, distribution, copying, or storage of this

>> >> message or any attachment is strictly prohibited.

>> >>

>> >> _______________________________________________

>> >> ceph-users mailing list

>> >> ceph-users@xxxxxxxxxxxxxx

>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>> >

>> >

>> >

>> > --

>> > - Rob

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>>

>>

>>

>> --

>> Cheers,

>> Brad

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

>

--

Cheers,

Brad

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com