Ah ok, good point What FS are you using on the RBD? > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Christian Eichelmann > Sent: 20 April 2015 13:16 > To: Nick Fisk; ceph-users@xxxxxxxxxxxxxx > Subject: Re: 100% IO Wait with CEPH RBD and RSYNC > > Hi Nick, > > I forgot to mention that I was also trying a workaround using the userland > (rbd-fuse). The behaviour was exactly the same (worked fine for several > hours, testing parallel reading and writing, then IO Wait and system load > increased). > > This is why I don't think it is an issue with the rbd kernel module. > > Regards, > Christian > > Am 20.04.2015 um 11:37 schrieb Nick Fisk: > > Hi Christian, > > > > A very non-technical answer but as the problem seems related to the > > RBD client it might be worth trying the latest Kernel if possible. The > > RBD client is Kernel based and so there may be a fix which might stop > > this from happening. > > > > Nick > > > >> -----Original Message----- > >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > >> Of Christian Eichelmann > >> Sent: 20 April 2015 08:29 > >> To: ceph-users@xxxxxxxxxxxxxx > >> Subject: 100% IO Wait with CEPH RBD and RSYNC > >> > >> Hi Ceph-Users! > >> > >> We currently have a problem where I am not sure if the it has it's > >> cause > > in > >> Ceph or something else. First, some information about our ceph-setup: > >> > >> * ceph version 0.87.1 > >> * 5 MON > >> * 12 OSD with 60x2TB each > >> * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1, > >> Debian > >> Wheezy) > >> > >> Our cluster is mainly used to store Log-Files from numerous servers > >> via > > RSync > >> and make them available via RSync as well. Since about two weeks we > >> have a very strange behaviour and our RSync Gateways (they just map > >> several rbd devices and "export" them via rsyncd): The IO Wait on the > >> systems are increasing untill some of the cores getting stuck with an IO > Wait of 100%. > >> RSync processes become zombies (defunct) and/or can not be killed > >> even with SIGKILL. After the system has reached a load of about 1400, > >> it > > becomes > >> totally unresponsive and the only way to "fix" the problem is to > >> reboot > > the > >> system. > >> > >> I was trying to manually reproduce the problem by simultainously > >> reading and writing from several machine, but the problem didn't appear. > >> > >> I have no idea where the error can be. I was doing a ceph tell osd.* > >> bench during the problem and all osds where having normal benchmark > >> results. Has anyone an idea how this can happen? If you need any more > >> informations, please let me know. > >> > >> Regards, > >> Christian > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > -- > Christian Eichelmann > Systemadministrator > > 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting > Brauerstraße 48 · DE-76135 Karlsruhe > Telefon: +49 721 91374-8026 > christian.eichelmann@xxxxxxxx > > Amtsgericht Montabaur / HRB 6484 > Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert > Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen > Aufsichtsratsvorsitzender: Michael Scheeren > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com