Re: 100% IO Wait with CEPH RBD and RSYNC

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Mon, 20 Apr 2015 14:16:24 +0200

Hi Nick,

I forgot to mention that I was also trying a workaround using the
userland (rbd-fuse). The behaviour was exactly the same (worked fine for
several hours, testing parallel reading and writing, then IO Wait and
system load increased).

This is why I don't think it is an issue with the rbd kernel module.

Regards,
Christian

Am 20.04.2015 um 11:37 schrieb Nick Fisk:
> Hi Christian,
> 
> A very non-technical answer but as the problem seems related to the RBD
> client it might be worth trying the latest Kernel if possible. The RBD
> client is Kernel based and so there may be a fix which might stop this from
> happening.
> 
> Nick 
> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Christian Eichelmann
>> Sent: 20 April 2015 08:29
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject:  100% IO Wait with CEPH RBD and RSYNC
>>
>> Hi Ceph-Users!
>>
>> We currently have a problem where I am not sure if the it has it's cause
> in
>> Ceph or something else. First, some information about our ceph-setup:
>>
>> * ceph version 0.87.1
>> * 5 MON
>> * 12 OSD with 60x2TB each
>> * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian
>> Wheezy)
>>
>> Our cluster is mainly used to store Log-Files from numerous servers via
> RSync
>> and make them available via RSync as well. Since about two weeks we have a
>> very strange behaviour and our RSync Gateways (they just map several rbd
>> devices and "export" them via rsyncd): The IO Wait on the systems are
>> increasing untill some of the cores getting stuck with an IO Wait of 100%.
>> RSync processes become zombies (defunct) and/or can not be killed even
>> with SIGKILL. After the system has reached a load of about 1400, it
> becomes
>> totally unresponsive and the only way to "fix" the problem is to reboot
> the
>> system.
>>
>> I was trying to manually reproduce the problem by simultainously reading
>> and writing from several machine, but the problem didn't appear.
>>
>> I have no idea where the error can be. I was doing a ceph tell osd.* bench
>> during the problem and all osds where having normal benchmark results. Has
>> anyone an idea how this can happen? If you need any more informations,
>> please let me know.
>>
>> Regards,
>> Christian
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

-- 
Christian Eichelmann
Systemadministrator

1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelmann@xxxxxxxx

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com