Re: 100% IO Wait with CEPH RBD and RSYNC

Nick Fisk <nick@xxxxxxxxxx> · Mon, 20 Apr 2015 13:32:26 +0100

Ah ok, good point

What FS are you using on the RBD?

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Eichelmann
> Sent: 20 April 2015 13:16
> To: Nick Fisk; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  100% IO Wait with CEPH RBD and RSYNC
> 
> Hi Nick,
> 
> I forgot to mention that I was also trying a workaround using the userland
> (rbd-fuse). The behaviour was exactly the same (worked fine for several
> hours, testing parallel reading and writing, then IO Wait and system load
> increased).
> 
> This is why I don't think it is an issue with the rbd kernel module.
> 
> Regards,
> Christian
> 
> Am 20.04.2015 um 11:37 schrieb Nick Fisk:
> > Hi Christian,
> >
> > A very non-technical answer but as the problem seems related to the
> > RBD client it might be worth trying the latest Kernel if possible. The
> > RBD client is Kernel based and so there may be a fix which might stop
> > this from happening.
> >
> > Nick
> >
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >> Of Christian Eichelmann
> >> Sent: 20 April 2015 08:29
> >> To: ceph-users@xxxxxxxxxxxxxx
> >> Subject:  100% IO Wait with CEPH RBD and RSYNC
> >>
> >> Hi Ceph-Users!
> >>
> >> We currently have a problem where I am not sure if the it has it's
> >> cause
> > in
> >> Ceph or something else. First, some information about our ceph-setup:
> >>
> >> * ceph version 0.87.1
> >> * 5 MON
> >> * 12 OSD with 60x2TB each
> >> * 2 RSYNC Gateways with 2x10G Ethernet (Kernel: 3.16.3-2~bpo70+1,
> >> Debian
> >> Wheezy)
> >>
> >> Our cluster is mainly used to store Log-Files from numerous servers
> >> via
> > RSync
> >> and make them available via RSync as well. Since about two weeks we
> >> have a very strange behaviour and our RSync Gateways (they just map
> >> several rbd devices and "export" them via rsyncd): The IO Wait on the
> >> systems are increasing untill some of the cores getting stuck with an
IO
> Wait of 100%.
> >> RSync processes become zombies (defunct) and/or can not be killed
> >> even with SIGKILL. After the system has reached a load of about 1400,
> >> it
> > becomes
> >> totally unresponsive and the only way to "fix" the problem is to
> >> reboot
> > the
> >> system.
> >>
> >> I was trying to manually reproduce the problem by simultainously
> >> reading and writing from several machine, but the problem didn't
appear.
> >>
> >> I have no idea where the error can be. I was doing a ceph tell osd.*
> >> bench during the problem and all osds where having normal benchmark
> >> results. Has anyone an idea how this can happen? If you need any more
> >> informations, please let me know.
> >>
> >> Regards,
> >> Christian
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@xxxxxxxxxxxxxx
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> 
> 
> --
> Christian Eichelmann
> Systemadministrator
> 
> 1&1 Internet AG - IT Operations Mail & Media Advertising & Targeting
> Brauerstraße 48 · DE-76135 Karlsruhe
> Telefon: +49 721 91374-8026
> christian.eichelmann@xxxxxxxx
> 
> Amtsgericht Montabaur / HRB 6484
> Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
> Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
> Aufsichtsratsvorsitzender: Michael Scheeren
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com