100% IO Wait with CEPH RBD and RSYNC

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Mon, 20 Apr 2015 09:29:22 +0200

Hi Ceph-Users!

We currently have a problem where I am not sure if the it has it's cause
in Ceph or something else. First, some information about our ceph-setup:

* ceph version 0.87.1
* 5 MON
* 12 OSD with 60x2TB each
* 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian
Wheezy)

Our cluster is mainly used to store Log-Files from numerous servers via
RSync and make them available via RSync as well. Since about two weeks
we have a very strange behaviour and our RSync Gateways (they just map
several rbd devices and "export" them via rsyncd): The IO Wait on the
systems are increasing untill some of the cores getting stuck with an IO
Wait of 100%. RSync processes become zombies (defunct) and/or can not be
killed even with SIGKILL. After the system has reached a load of about
1400, it becomes totally unresponsive and the only way to "fix" the
problem is to reboot the system.

I was trying to manually reproduce the problem by simultainously reading
and writing from several machine, but the problem didn't appear.

I have no idea where the error can be. I was doing a ceph tell osd.*
bench during the problem and all osds where having normal benchmark
results. Has anyone an idea how this can happen? If you need any more
informations, please let me know.

Regards,
Christian

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com