Re: Idle OSD's keep using a lot of CPU

Erik Logtenberg <erik@xxxxxxxxxxxxx> · Fri, 02 Aug 2013 14:09:45 +0200

Hi Mark,

Yes, I do believe so. When I run ceph -w now, I see a healthy cluster,
but during the benchmark one of the osd's went "down". The osd daemon
process was never down, and eventually it was marked "up" again some
time after the benchmark finished. There was some rebuilding/checking
because some of the pg's were stale+active+rebuilding or something like
that, but in the end all pg's were active+clean again.
During all this, I do believe the monitor was working properly.

Still, the osd's on the second node all report "hunting for new mon"
every now and then. But I don't see any cause for this. Apart from the
few benchmarks I ran, there is no activity whatsoever.

Erik.

On 08/02/2013 01:34 PM, Mark Nelson wrote:
> Hi Erik,
> 
> Is your mon still running properly?
> 
> Mark
> 
> On 08/01/2013 05:06 PM, Erik Logtenberg wrote:
>> Hi,
>>
>> I think the high CPU usage was due to the system time not being right. I
>> activated ntp and it had to do quite big adjustment, and after that the
>> high CPU usage was gone.
>>
>> Anyway, I immediately ran into another issue. I ran a simple benchmark:
>> # rados bench --pool benchmark 300 write --no-cleanup
>>
>> During the benchmark, one of my osd's went down. I checked the logs and
>> apparently there was no hardware failure (the disk is still nicely
>> mounted and the osd is still running, but the logfile fills up rapidly
>> with these messages:
>>
>> 2013-08-02 00:03:40.014982 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
>> 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36884 s=2 pgs=86874
>> cs=173547 l=0).fault, initiating reconnect
>> 2013-08-02 00:03:40.016682 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
>> 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36885 s=2 pgs=86875
>> cs=173549 l=0).fault, initiating reconnect
>> 2013-08-02 00:03:40.019241 7fe7336fd700  0 -- 192.168.1.15:6801/1229 >>
>> 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36886 s=2 pgs=86876
>> cs=173551 l=0).fault, initiating reconnect
>>
>> What could be wrong here?
>>
>> King regards,
>>
>> Erik.
>>
>>
>>
>> On 08/01/2013 08:00 AM, Dan Mick wrote:
>>> Logging might well help.
>>>
>>> http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/
>>>
>>>
>>>
>>> On 07/31/2013 03:51 PM, Erik Logtenberg wrote:
>>>> Hi,
>>>>
>>>> I just added a second node to my ceph test platform. The first node has
>>>> a mon and three osd's, the second node only has three osd's. Adding the
>>>> osd's was pretty painless, and ceph distributed the data from the first
>>>> node evenly over both nodes so everything seems to be fine. The monitor
>>>> also thinks everything is fine:
>>>>
>>>> 2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292
>>>> active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail
>>>>
>>>> Unfortunately, the three osd's on the second node keep eating a lot of
>>>> cpu, while there is no activity whatsoever:
>>>>
>>>>     PID USER      VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>>>> 21272 root      441440  34632   7848 S  61.8  0.9   4:08.62 ceph-osd
>>>> 21145 root      439852  29316   8360 S  60.4  0.7   4:04.31 ceph-osd
>>>> 21036 root      443828  31324   8336 S  60.1  0.8   4:07.55 ceph-osd
>>>>
>>>> Any idea why that is and how I can even ask an osd what it's doing?
>>>> There is no corresponding hdd activity, it's only cpu and hardly any
>>>> memory usage.
>>>>
>>>> Also the monitor on the first node is doing the same thing:
>>>>
>>>>     PID USER    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>> 12825 root    186900  23492   5540 S 141.1 0.590   9:47.64 ceph-mon
>>>>
>>>> I tried stopping the three osd's: that makes the monitor calm down, but
>>>> after restarting the osd's, the monitor resumes its cpu usage. I also
>>>> tried stopping the monitor, which makes the three osd's calm down, but
>>>> once again they will start eating cpu again as soon as the monitor is
>>>> back online.
>>>>
>>>> In the mean time, the first three osd's, the ones on the same
>>>> machine as
>>>> the monitor, don't behave like this at all. Currently as there is no
>>>> activity, they are just idling on low cpu usage, as expected.
>>>>
>>>> Kind regards,
>>>>
>>>> Erik.
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com