Hi Marcel, FileStore doesn't subscribe for any such event from the device. Presently, it is relying on filesystem (for the FileStore assert) to return back error during IO and based on the error it is giving an assert. FileJournal assert you are getting in the aio path is relying on linux aio system to report an error. It should get these asserts pretty quickly not couple of minutes if IO is on. Are you saying this crash timestamp is couple of minutes after ? BTW, if you are on Ubuntu , upstart will restart the OSDs after crash and based on some logic (more frequent crash) it will eventually decide not to. So, in the log try to get the very first crash trace and see when it occurred. BTW, hope you are aware that recovery will not be kicking off unless there is some grace period (configurable) is over. Thanks & Regards Somnath -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Marcel Lauhoff Sent: Tuesday, May 17, 2016 5:59 AM To: ceph-users Subject: OSD process doesn't die immediately after device disappears Hi, we recently played the good ol' pull a harddrive game and wondered, why the OSD process took a couple of minutes to recognize their misfortune. In our configuration two OSDs share an HDD: OSD n as its journal device, OSD n+1 as its filesystem. We expected that OSDs detect this kind of failure and immediately shut down, so that transactions aren't blocked and recovery can start as soon as possible. What do you think? I read through the FileStore code about a year ago and can't remember any code that somehow subscribes to events of the underlying devices. Does anyone use external watchdog tools for this type of failure? ~irq0 The last messages of the two OSD daemons: 2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 18446744073709551611 2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 14:57:25.613475 os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error") 2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Trans action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 2016-04-27 14:57:22.489978 os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error") -- Marcel Lauhoff Mail: lauhoff@xxxxxxxxxxxx XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com