For some weird reason I couldn't figure out, after I simultaneously brought down all components of my ceph cluster and then brought them back up, the mds wouldn't come back, complaining about a zero-sized entry in its journal some 8+MB behind the end of the journal. I hadn't ever got this problem, and it's not entirely unusual for me to restart all cluster components at once after some configuration change. Anyway... Long story short, after some poking at the mds journal to see if I could figure out how to get it back up, I gave up and decided to use the --reset-journal hammer. Except that it just sat there, never completing or even getting noticed by the cluster. After a bit of additional investigation, the following patch was born, and now my Emperor cluster is back up. Phew! :-)
mds: fix Resetter locking From: Alexandre Oliva <oliva@xxxxxxx> ceph-mds --reset-journal didn't work; it would deadlock waiting for the osdmap. Comparing the init code in the Dumper (that worked) with that in the Resetter (that didn't), I noticed the lock had to be released before waiting for the osdmap. Now the resetter works. However, both the resetter and the dumper fail an assertion after they've performed their task; I didn't look into it: ../../src/msg/SimpleMessenger.cc: In function 'void SimpleMessenger::reaper()' t hread 7fdc188d27c0 time 2013-12-19 04:48:16.930895 ../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared) ceph version 0.72.1-6-g6bca44e (6bca44ec129d11f1c4f38357db8ae435616f2c7c) 1: (SimpleMessenger::reaper()+0x706) [0x880da6] 2: (SimpleMessenger::wait()+0x36f) [0x88180f] 3: (Resetter::reset()+0x714) [0x56e664] 4: (main()+0x1359) [0x562769] 5: (__libc_start_main()+0xf5) [0x3632e21b45] 6: /l/tmp/build/ceph/build/src/ceph-mds() [0x564e49] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to int erpret this. 2013-12-19 04:48:16.934093 7fdc188d27c0 -1 ../../src/msg/SimpleMessenger.cc: In function 'void SimpleMessenger::reaper()' thread 7fdc188d27c0 time 2013-12-19 04 :48:16.930895 ../../src/msg/SimpleMessenger.cc: 230: FAILED assert(!cleared) Signed-off-by: Alexandre Oliva <oliva@xxxxxxx> --- src/mds/Resetter.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/mds/Resetter.cc b/src/mds/Resetter.cc index e968cdc..ed409a4 100644 --- a/src/mds/Resetter.cc +++ b/src/mds/Resetter.cc @@ -79,9 +79,9 @@ void Resetter::init(int rank) objecter->init_unlocked(); lock.Lock(); objecter->init_locked(); + lock.Unlock(); objecter->wait_for_osd_map(); timer.init(); - lock.Unlock(); } void Resetter::shutdown()
-- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer