Hi, we recently played the good ol' pull a harddrive game and wondered, why the OSD process took a couple of minutes to recognize their misfortune. In our configuration two OSDs share an HDD: OSD n as its journal device, OSD n+1 as its filesystem. We expected that OSDs detect this kind of failure and immediately shut down, so that transactions aren't blocked and recovery can start as soon as possible. What do you think? I read through the FileStore code about a year ago and can't remember any code that somehow subscribes to events of the underlying devices. Does anyone use external watchdog tools for this type of failure? ~irq0 The last messages of the two OSD daemons: 2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 18446744073709551611 2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 14:57:25.613475 os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error") 2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Trans action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 2016-04-27 14:57:22.489978 os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error") -- Marcel Lauhoff Mail: lauhoff@xxxxxxxxxxxx XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com