OSD process doesn't die immediately after device disappears

Marcel Lauhoff <lauhoff@xxxxxxxxxxxx> · Tue, 17 May 2016 14:59:11 +0200

Hi,

we recently played the good ol' pull a harddrive game and wondered, why
the OSD process took a couple of minutes to recognize their misfortune.

In our configuration two OSDs share an HDD:
  OSD n as its journal device,
  OSD n+1 as its filesystem.

We expected that OSDs detect this kind of failure and immediately
shut down, so that transactions aren't blocked and recovery can start as
soon as possible.

What do you think?

I read through the FileStore code about a year ago and can't remember
any code that somehow subscribes to events of the underlying devices.

Does anyone use external watchdog tools for this type of failure?

~irq0

The last messages of the two OSD daemons:

2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 18446744073709551611
2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 14:57:25.613475
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Trans
action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 2016-04-27 14:57:22.489978
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

--
Marcel Lauhoff
Mail: lauhoff@xxxxxxxxxxxx
XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com