Re: OSD process doesn't die immediately after device disappears

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Marcel,
FileStore doesn't subscribe for any such event from the device. Presently, it is relying on filesystem (for the FileStore assert) to return back error during IO and based on the error it is giving an assert.
FileJournal assert you are getting in the aio path is relying on linux aio system to report an error.
It should get these asserts pretty quickly not couple of minutes if IO is on.
Are you saying this crash timestamp is couple of minutes after ?
BTW, if you are on Ubuntu , upstart will restart the OSDs after crash and based on some logic (more frequent crash)  it will eventually decide not to. So, in the log try to get the very first crash trace and see when it occurred.
BTW, hope you are aware that recovery will not be kicking off unless there is some grace period (configurable) is over.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Marcel Lauhoff
Sent: Tuesday, May 17, 2016 5:59 AM
To: ceph-users
Subject:  OSD process doesn't die immediately after device disappears


Hi,

we recently played the good ol' pull a harddrive game and wondered, why the OSD process took a couple of minutes to recognize their misfortune.

In our configuration two OSDs share an HDD:
  OSD n as its journal device,
  OSD n+1 as its filesystem.

We expected that OSDs detect this kind of failure and immediately shut down, so that transactions aren't blocked and recovery can start as soon as possible.

What do you think?


I read through the FileStore code about a year ago and can't remember any code that somehow subscribes to events of the underlying devices.

Does anyone use external watchdog tools for this type of failure?



~irq0


The last messages of the two OSD daemons:

2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 18446744073709551611
2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 14:57:25.613475
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Trans
action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 2016-04-27 14:57:22.489978
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

--
Marcel Lauhoff
Mail: lauhoff@xxxxxxxxxxxx
XMPP: mlauhoff@xxxxxxxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux