Re: Idle OSD's keep using a lot of CPU

Erik Logtenberg <erik@xxxxxxxxxxxxx> · Fri, 02 Aug 2013 00:06:55 +0200

Hi,

I think the high CPU usage was due to the system time not being right. I
activated ntp and it had to do quite big adjustment, and after that the
high CPU usage was gone.

Anyway, I immediately ran into another issue. I ran a simple benchmark:
# rados bench --pool benchmark 300 write --no-cleanup

During the benchmark, one of my osd's went down. I checked the logs and
apparently there was no hardware failure (the disk is still nicely
mounted and the osd is still running, but the logfile fills up rapidly
with these messages:

2013-08-02 00:03:40.014982 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36884 s=2 pgs=86874
cs=173547 l=0).fault, initiating reconnect
2013-08-02 00:03:40.016682 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36885 s=2 pgs=86875
cs=173549 l=0).fault, initiating reconnect
2013-08-02 00:03:40.019241 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36886 s=2 pgs=86876
cs=173551 l=0).fault, initiating reconnect

What could be wrong here?

King regards,

Erik.

On 08/01/2013 08:00 AM, Dan Mick wrote:
> Logging might well help.
> 
> http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
> 
> 
> 
> On 07/31/2013 03:51 PM, Erik Logtenberg wrote:
>> Hi,
>>
>> I just added a second node to my ceph test platform. The first node has
>> a mon and three osd's, the second node only has three osd's. Adding the
>> osd's was pretty painless, and ceph distributed the data from the first
>> node evenly over both nodes so everything seems to be fine. The monitor
>> also thinks everything is fine:
>>
>> 2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292
>> active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail
>>
>> Unfortunately, the three osd's on the second node keep eating a lot of
>> cpu, while there is no activity whatsoever:
>>
>>    PID USER      VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>> 21272 root      441440  34632   7848 S  61.8  0.9   4:08.62 ceph-osd
>> 21145 root      439852  29316   8360 S  60.4  0.7   4:04.31 ceph-osd
>> 21036 root      443828  31324   8336 S  60.1  0.8   4:07.55 ceph-osd
>>
>> Any idea why that is and how I can even ask an osd what it's doing?
>> There is no corresponding hdd activity, it's only cpu and hardly any
>> memory usage.
>>
>> Also the monitor on the first node is doing the same thing:
>>
>>    PID USER    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>> 12825 root    186900  23492   5540 S 141.1 0.590   9:47.64 ceph-mon
>>
>> I tried stopping the three osd's: that makes the monitor calm down, but
>> after restarting the osd's, the monitor resumes its cpu usage. I also
>> tried stopping the monitor, which makes the three osd's calm down, but
>> once again they will start eating cpu again as soon as the monitor is
>> back online.
>>
>> In the mean time, the first three osd's, the ones on the same machine as
>> the monitor, don't behave like this at all. Currently as there is no
>> activity, they are just idling on low cpu usage, as expected.
>>
>> Kind regards,
>>
>> Erik.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com