Re: 100% IO Wait with CEPH RBD and RSYNC

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Tue, 21 Apr 2015 10:20:38 +0200

Hi Christian,

I've never debugged the kernel client either, so I don't know how to
increase debugging. (I don't see any useful parms on the kernel
modules).

Your log looks like the client just stops communicating with the ceph
cluster. Is iptables getting in the way ?

Cheers, Dan

On Tue, Apr 21, 2015 at 9:13 AM, Christian Eichelmann
<christian.eichelmann@xxxxxxxx> wrote:
> Hi Dan,
>
> we are alreay back on the kernel module since the same problems were
> happening with fuse. I had no special ulimit settings for the
> fuse-process, so that could have been an issue there.
>
> I was pasting you the kernel messages during such incidents here:
> http://pastebin.com/X5JRe1v3
>
> I was never debugging the kernel client. Can you give me a short hint
> how to increase the debug level and where the logs will be written to?
>
> Regards,
> Christian
>
> Am 20.04.2015 um 15:50 schrieb Dan van der Ster:
>> Hi,
>> This is similar to what you would observe if you hit the ulimit on
>> open files/sockets in a Ceph client. Though that normally only affects
>> clients in user mode, not the kernel. What are the ulimits of your
>> rbd-fuse client? Also, you could increase the client logging debug
>> levels to see why the client is hanging. When the kernel rbd client
>> was hanging, was there anything printed to dmesg ?
>> Cheers, Dan
>>
>> On Mon, Apr 20, 2015 at 9:29 AM, Christian Eichelmann
>> <christian.eichelmann@xxxxxxxx> wrote:
>>> Hi Ceph-Users!
>>>
>>> We currently have a problem where I am not sure if the it has it's cause
>>> in Ceph or something else. First, some information about our ceph-setup:
>>>
>>> * ceph version 0.87.1
>>> * 5 MON
>>> * 12 OSD with 60x2TB each
>>> * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian
>>> Wheezy)
>>>
>>> Our cluster is mainly used to store Log-Files from numerous servers via
>>> RSync and make them available via RSync as well. Since about two weeks
>>> we have a very strange behaviour and our RSync Gateways (they just map
>>> several rbd devices and "export" them via rsyncd): The IO Wait on the
>>> systems are increasing untill some of the cores getting stuck with an IO
>>> Wait of 100%. RSync processes become zombies (defunct) and/or can not be
>>> killed even with SIGKILL. After the system has reached a load of about
>>> 1400, it becomes totally unresponsive and the only way to "fix" the
>>> problem is to reboot the system.
>>>
>>> I was trying to manually reproduce the problem by simultainously reading
>>> and writing from several machine, but the problem didn't appear.
>>>
>>> I have no idea where the error can be. I was doing a ceph tell osd.*
>>> bench during the problem and all osds where having normal benchmark
>>> results. Has anyone an idea how this can happen? If you need any more
>>> informations, please let me know.
>>>
>>> Regards,
>>> Christian
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian Eichelmann
> Systemadministrator
>
> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
> Brauerstraße 48 · DE-76135 Karlsruhe
> Telefon: +49 721 91374-8026
> christian.eichelmann@xxxxxxxx
>
> Amtsgericht Montabaur / HRB 6484
> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
> Aufsichtsratsvorsitzender: Michael Scheeren
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com