ceph-osd restartd via systemd in case of disk error

Manuel Lausch <manuel.lausch@xxxxxxxx> · Tue, 19 Sep 2017 10:02:15 +0200

Hi,

I see a issue with systemd's restart behaviour and disk IO-errors 
If a disk fails with IO-errors ceph-osd stops running. Systemd detects
this and starts the daemon again. In our cluster I did see some loops
with osd crashes caused by disk failure and restarts triggerd by
systemd. Every time with peering impact and timeouts to our application
until systemd gave up.

Obviously ceph needs the restart feature (at least with dmcrypt) to
avoid raceconditions In the startup process. But in the
case of disk related failures this is contraproductive. 

What do you think about this? Is this a bug which should be fixed?

We use ceph jewel (10.2.9)

Regards
Manuel 

-- 
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lausch@xxxxxxxx | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Thomas Ludwig, Jan Oetjen

Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
verwenden.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this e-mail
in any way is prohibited. If you have received this e-mail in error,
please notify the sender and delete the e-mail.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com