Hi Ceph-Users! We currently have a problem where I am not sure if the it has it's cause in Ceph or something else. First, some information about our ceph-setup: * ceph version 0.87.1 * 5 MON * 12 OSD with 60x2TB each * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, Debian Wheezy) Our cluster is mainly used to store Log-Files from numerous servers via RSync and make them available via RSync as well. Since about two weeks we have a very strange behaviour and our RSync Gateways (they just map several rbd devices and "export" them via rsyncd): The IO Wait on the systems are increasing untill some of the cores getting stuck with an IO Wait of 100%. RSync processes become zombies (defunct) and/or can not be killed even with SIGKILL. After the system has reached a load of about 1400, it becomes totally unresponsive and the only way to "fix" the problem is to reboot the system. I was trying to manually reproduce the problem by simultainously reading and writing from several machine, but the problem didn't appear. I have no idea where the error can be. I was doing a ceph tell osd.* bench during the problem and all osds where having normal benchmark results. Has anyone an idea how this can happen? If you need any more informations, please let me know. Regards, Christian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com