Re: Idle OSD's keep using a lot of CPU

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Fri, 02 Aug 2013 06:34:13 -0500

Hi Erik,

Is your mon still running properly?

Mark

On 08/01/2013 05:06 PM, Erik Logtenberg wrote:
Hi,

I think the high CPU usage was due to the system time not being right. I
activated ntp and it had to do quite big adjustment, and after that the
high CPU usage was gone.

Anyway, I immediately ran into another issue. I ran a simple benchmark:
# rados bench --pool benchmark 300 write --no-cleanup

During the benchmark, one of my osd's went down. I checked the logs and
apparently there was no hardware failure (the disk is still nicely
mounted and the osd is still running, but the logfile fills up rapidly
with these messages:

2013-08-02 00:03:40.014982 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36884 s=2 pgs=86874
cs=173547 l=0).fault, initiating reconnect
2013-08-02 00:03:40.016682 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36885 s=2 pgs=86875
cs=173549 l=0).fault, initiating reconnect
2013-08-02 00:03:40.019241 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36886 s=2 pgs=86876
cs=173551 l=0).fault, initiating reconnect

What could be wrong here?

King regards,

Erik.

On 08/01/2013 08:00 AM, Dan Mick wrote:
Logging might well help.

http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/

On 07/31/2013 03:51 PM, Erik Logtenberg wrote:
Hi,

I just added a second node to my ceph test platform. The first node has
a mon and three osd's, the second node only has three osd's. Adding the
osd's was pretty painless, and ceph distributed the data from the first
node evenly over both nodes so everything seems to be fine. The monitor
also thinks everything is fine:

2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292
active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail

Unfortunately, the three osd's on the second node keep eating a lot of
cpu, while there is no activity whatsoever:

    PID USER      VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21272 root      441440  34632   7848 S  61.8  0.9   4:08.62 ceph-osd
21145 root      439852  29316   8360 S  60.4  0.7   4:04.31 ceph-osd
21036 root      443828  31324   8336 S  60.1  0.8   4:07.55 ceph-osd

Any idea why that is and how I can even ask an osd what it's doing?
There is no corresponding hdd activity, it's only cpu and hardly any
memory usage.

Also the monitor on the first node is doing the same thing:

    PID USER    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
12825 root    186900  23492   5540 S 141.1 0.590   9:47.64 ceph-mon

I tried stopping the three osd's: that makes the monitor calm down, but
after restarting the osd's, the monitor resumes its cpu usage. I also
tried stopping the monitor, which makes the three osd's calm down, but
once again they will start eating cpu again as soon as the monitor is
back online.

In the mean time, the first three osd's, the ones on the same machine as
the monitor, don't behave like this at all. Currently as there is no
activity, they are just idling on low cpu usage, as expected.

Kind regards,

Erik.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com