Refer to my previous post for data you can gather that will help narrow this down. On Mon, Feb 20, 2017 at 6:36 PM, Jay Linux <jaylinuxgeek at gmail.com> wrote: > Hello John, > > Created tracker for this issue Refer-- > > http://tracker.ceph.com/issues/18994 > > Thanks > > On Fri, Feb 17, 2017 at 6:15 PM, John Spray <jspray at redhat.com> wrote: >> >> On Fri, Feb 17, 2017 at 6:27 AM, Muthusamy Muthiah >> <muthiah.muthusamy at gmail.com> wrote: >> > On one our platform mgr uses 3 CPU cores . Is there a ticket available >> > for >> > this issue ? >> >> Not that I'm aware of, you could go ahead and open one. >> >> Cheers, >> John >> >> > Thanks, >> > Muthu >> > >> > On 14 February 2017 at 03:13, Brad Hubbard <bhubbard at redhat.com> wrote: >> >> >> >> Could one of the reporters open a tracker for this issue and attach >> >> the requested debugging data? >> >> >> >> On Mon, Feb 13, 2017 at 11:18 PM, Donny Davis <donny at fortnebula.com> >> >> wrote: >> >> > I am having the same issue. When I looked at my idle cluster this >> >> > morning, >> >> > one of the nodes had 400% cpu utilization, and ceph-mgr was 300% of >> >> > that. I >> >> > have 3 AIO nodes, and only one of them seemed to be affected. >> >> > >> >> > On Sat, Jan 14, 2017 at 12:18 AM, Brad Hubbard <bhubbard at redhat.com> >> >> > wrote: >> >> >> >> >> >> Want to install debuginfo packages and use something like this to >> >> >> try >> >> >> and find out where it is spending most of its time? >> >> >> >> >> >> https://poormansprofiler.org/ >> >> >> >> >> >> Note that you may need to do multiple runs to get a "feel" for where >> >> >> it is spending most of its time. Also not that likely only one or >> >> >> two >> >> >> threads will be using the CPU (you can see this in ps output using a >> >> >> command like the following) the rest will likely be idle or waiting >> >> >> for something. >> >> >> >> >> >> # ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan >> >> >> >> >> >> Observation of these two and maybe a couple of manual gstack dumps >> >> >> like this to compare thread ids to ps output (LWP is the thread id >> >> >> (tid) in gdb output) should give us some idea of where it is >> >> >> spinning. >> >> >> >> >> >> # gstack $(pidof ceph-mgr) >> >> >> >> >> >> >> >> >> On Sat, Jan 14, 2017 at 9:54 AM, Robert Longstaff >> >> >> <robert.longstaff at tapad.com> wrote: >> >> >> > FYI, I'm seeing this as well on the latest Kraken 11.1.1 RPMs on >> >> >> > CentOS >> >> >> > 7 w/ >> >> >> > elrepo kernel 4.8.10. ceph-mgr is currently tearing through CPU >> >> >> > and >> >> >> > has >> >> >> > allocated ~11GB of RAM after a single day of usage. Only the >> >> >> > active >> >> >> > manager >> >> >> > is performing this way. The growth is linear and reproducible. >> >> >> > >> >> >> > The cluster is mostly idle; 3 mons (4 CPU, 16GB), 20 heads with >> >> >> > 45x8TB >> >> >> > OSDs >> >> >> > each. >> >> >> > >> >> >> > >> >> >> > top - 23:45:47 up 1 day, 1:32, 1 user, load average: 3.56, >> >> >> > 3.94, >> >> >> > 4.21 >> >> >> > >> >> >> > Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 >> >> >> > zombie >> >> >> > >> >> >> > %Cpu(s): 33.9 us, 28.1 sy, 0.0 ni, 37.3 id, 0.0 wa, 0.0 hi, >> >> >> > 0.7 >> >> >> > si, >> >> >> > 0.0 >> >> >> > st >> >> >> > >> >> >> > KiB Mem : 16423844 total, 3980500 free, 11556532 used, 886812 >> >> >> > buff/cache >> >> >> > >> >> >> > KiB Swap: 2097148 total, 2097148 free, 0 used. 4836772 >> >> >> > avail >> >> >> > Mem >> >> >> > >> >> >> > >> >> >> > PID USER PR NI VIRT RES SHR S %CPU %MEM >> >> >> > TIME+ >> >> >> > COMMAND >> >> >> > >> >> >> > 2351 ceph 20 0 12.160g 0.010t 17380 S 203.7 64.8 >> >> >> > 2094:27 >> >> >> > ceph-mgr >> >> >> > >> >> >> > 2302 ceph 20 0 620316 267992 157620 S 2.3 1.6 >> >> >> > 65:11.50 >> >> >> > ceph-mon >> >> >> > >> >> >> > >> >> >> > On Wed, Jan 11, 2017 at 12:00 PM, Stillwell, Bryan J >> >> >> > <Bryan.Stillwell at charter.com> wrote: >> >> >> >> >> >> >> >> John, >> >> >> >> >> >> >> >> This morning I compared the logs from yesterday and I show a >> >> >> >> noticeable >> >> >> >> increase in messages like these: >> >> >> >> >> >> >> >> 2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 575 >> >> >> >> 2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 441 >> >> >> >> 2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all mon_status >> >> >> >> 2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all health >> >> >> >> 2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all pg_summary >> >> >> >> 2017-01-11 09:00:03.033613 7f70f15c1700 4 mgr ms_dispatch active >> >> >> >> mgrdigest v1 >> >> >> >> 2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch >> >> >> >> mgrdigest >> >> >> >> v1 >> >> >> >> 2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 575 >> >> >> >> 2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest >> >> >> >> 441 >> >> >> >> 2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all mon_status >> >> >> >> 2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all health >> >> >> >> 2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all >> >> >> >> notify_all: >> >> >> >> notify_all pg_summary >> >> >> >> 2017-01-11 09:00:03.532898 7f70f15c1700 4 mgr ms_dispatch active >> >> >> >> mgrdigest v1 >> >> >> >> 2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch >> >> >> >> mgrdigest >> >> >> >> v1 >> >> >> >> >> >> >> >> >> >> >> >> In a 1 minute period yesterday I saw 84 times this group of >> >> >> >> messages >> >> >> >> showed up. Today that same group of messages showed up 156 >> >> >> >> times. >> >> >> >> >> >> >> >> Other than that I did see an increase in this messages from 9 >> >> >> >> times >> >> >> >> a >> >> >> >> minute to 14 times a minute: >> >> >> >> >> >> >> >> 2017-01-11 09:00:00.402000 7f70f3d61700 0 -- >> >> >> >> 172.24.88.207:6800/4104 >> >> >> >> >> - >> >> >> >> conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR >> >> >> >> pgs=0 >> >> >> >> cs=0 >> >> >> >> l=0).fault with nothing to send and in the half accept state >> >> >> >> just >> >> >> >> closed >> >> >> >> >> >> >> >> Let me know if you need anything else. >> >> >> >> >> >> >> >> Bryan >> >> >> >> >> >> >> >> >> >> >> >> On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan >> >> >> >> J" >> >> >> >> <ceph-users-bounces at lists.ceph.com on behalf of >> >> >> >> Bryan.Stillwell at charter.com> wrote: >> >> >> >> >> >> >> >> >On 1/10/17, 5:35 AM, "John Spray" <jspray at redhat.com> wrote: >> >> >> >> > >> >> >> >> >>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J >> >> >> >> >><Bryan.Stillwell at charter.com> wrote: >> >> >> >> >>> Last week I decided to play around with Kraken >> >> >> >> >>> (11.1.1-1xenial) >> >> >> >> >>> on >> >> >> >> >>> a >> >> >> >> >>> single node, two OSD cluster, and after a while I noticed >> >> >> >> >>> that >> >> >> >> >>> the >> >> >> >> >>> new >> >> >> >> >>> ceph-mgr daemon is frequently using a lot of the CPU: >> >> >> >> >>> >> >> >> >> >>> 17519 ceph 20 0 850044 168104 208 S 102.7 4.3 >> >> >> >> >>> 1278:27 >> >> >> >> >>> ceph-mgr >> >> >> >> >>> >> >> >> >> >>> Restarting it with 'systemctl restart ceph-mgr*' seems to get >> >> >> >> >>> its >> >> >> >> >>> CPU >> >> >> >> >>> usage down to < 1%, but after a while it climbs back up to > >> >> >> >> >>> 100%. >> >> >> >> >>> Has >> >> >> >> >>> anyone else seen this? >> >> >> >> >> >> >> >> >> >>Definitely worth investigating, could you set "debug mgr = 20" >> >> >> >> >> on >> >> >> >> >> the >> >> >> >> >>daemon to see if it's obviously spinning in a particular place? >> >> >> >> > >> >> >> >> >I've injected that option to the ceps-mgr process, and now I'm >> >> >> >> > just >> >> >> >> >waiting for it to go out of control again. >> >> >> >> > >> >> >> >> >However, I've noticed quite a few messages like this in the logs >> >> >> >> > already: >> >> >> >> > >> >> >> >> >2017-01-10 09:56:07.441678 7f70f4562700 0 -- >> >> >> >> > 172.24.88.207:6800/4104 >> >> >> >> > >> >> >> >> >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >> >> >> >> > s=STATE_OPEN >> >> >> >> > pgs=2 >> >> >> >> >cs=1 l=0).fault initiating reconnect >> >> >> >> >2017-01-10 09:56:07.442044 7f70f4562700 0 -- >> >> >> >> > 172.24.88.207:6800/4104 >> >> >> >> > >> >> >> >> >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> >> >> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >> >> >> >> > l=0).handle_connect_msg >> >> >> >> >accept connect_seq 0 vs existing csq=2 >> >> >> >> > existing_state=STATE_CONNECTING >> >> >> >> >2017-01-10 09:56:07.442067 7f70f4562700 0 -- >> >> >> >> > 172.24.88.207:6800/4104 >> >> >> >> > >> >> >> >> >> >172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800 >> >> >> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 >> >> >> >> > l=0).handle_connect_msg >> >> >> >> >accept peer reset, then tried to connect to us, replacing >> >> >> >> >2017-01-10 09:56:07.443026 7f70f4562700 0 -- >> >> >> >> > 172.24.88.207:6800/4104 >> >> >> >> > >> >> >> >> >> >172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 >> >> >> >> >s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with >> >> >> >> > nothing >> >> >> >> > to >> >> >> >> >send and in the half accept state just closed >> >> >> >> > >> >> >> >> > >> >> >> >> >What's weird about that is that this is a single node cluster >> >> >> >> > with >> >> >> >> >ceph-mgr, ceph-mon, and the ceph-osd processes all running on >> >> >> >> > the >> >> >> >> > same >> >> >> >> >host. So none of the communication should be leaving the node. >> >> >> >> > >> >> >> >> >Bryan >> >> >> >> >> >> >> >> E-MAIL CONFIDENTIALITY NOTICE: >> >> >> >> The contents of this e-mail message and any attachments are >> >> >> >> intended >> >> >> >> solely for the addressee(s) and may contain confidential and/or >> >> >> >> legally >> >> >> >> privileged information. If you are not the intended recipient of >> >> >> >> this >> >> >> >> message or if this message has been addressed to you in error, >> >> >> >> please >> >> >> >> immediately alert the sender by reply e-mail and then delete this >> >> >> >> message >> >> >> >> and any attachments. If you are not the intended recipient, you >> >> >> >> are >> >> >> >> notified >> >> >> >> that any use, dissemination, distribution, copying, or storage of >> >> >> >> this >> >> >> >> message or any attachment is strictly prohibited. >> >> >> >> >> >> >> >> _______________________________________________ >> >> >> >> ceph-users mailing list >> >> >> >> ceph-users at lists.ceph.com >> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > - Rob >> >> >> > >> >> >> > _______________________________________________ >> >> >> > ceph-users mailing list >> >> >> > ceph-users at lists.ceph.com >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Cheers, >> >> >> Brad >> >> >> _______________________________________________ >> >> >> ceph-users mailing list >> >> >> ceph-users at lists.ceph.com >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Cheers, >> >> Brad >> >> _______________________________________________ >> >> ceph-users mailing list >> >> ceph-users at lists.ceph.com >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> > >> > >> > _______________________________________________ >> > ceph-users mailing list >> > ceph-users at lists.ceph.com >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Cheers, Brad