On Wednesday, July 4, 2012 at 10:02 AM, Smart Weblications GmbH - Florian Wiessner wrote: > Am 04.07.2012 18:25, schrieb Gregory Farnum: > > > > > > On Wednesday, July 4, 2012 at 4:45 AM, Smart Weblications GmbH - Florian Wiessner wrote: > > > > > Hi List, > > > > > > > > > i today upgraded from 0.43 to 0.48 and now i have one monitor which does not > > > want to start up anymore: > > > > > > ceph version 0.48argonaut-125-g4e774fb > > > (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) > > > 1: /usr/bin/ceph-mon() [0x52f9c9] > > > 2: (()+0xeff0) [0x7fb08dd11ff0] > > > 3: (gsignal()+0x35) [0x7fb08c4f41b5] > > > 4: (abort()+0x180) [0x7fb08c4f6fc0] > > > 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb08cd88dc5] > > > 6: (()+0xcb166) [0x7fb08cd87166] > > > 7: (()+0xcb193) [0x7fb08cd87193] > > > 8: (()+0xcb28e) [0x7fb08cd8728e] > > > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x940) > > > [0x55b310] > > > 10: /usr/bin/ceph-mon() [0x497317] > > > 11: (Monitor::init()+0xc5a) [0x4857fa] > > > 12: (main()+0x2789) [0x46ac79] > > > 13: (__libc_start_main()+0xfd) [0x7fb08c4e0c8d] > > > 14: /usr/bin/ceph-mon() [0x468309] > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > > > interpret this. > > > > > > --- end dump of recent events --- > > > > > > > > > How can i find out why it does not startup anymore? osd and mds is running fine.. > > Is that all the output you get? There should be a line somewhere which says what the assert is, and what line number it's on. :) > > > > > Is this what you are looking for: > 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init fsid > 4553d0f6-1b31-4ba5-9d97-edae55bcaab4 > 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): In function 'bool > Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 11:20:24.448637 > mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping == 1)) Yep, that line. This means the monitor's on-disk state is inconsistent, but I can think of a number of scenarios which could have caused this, depending on how you upgraded your cluster (older monitors didn't mark on-disk whenever they deliberately went inconsistent on a catchup, which I bet is what happened here). > ceph version 0.48argonaut-125-g4e774fb > (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) > 1: /usr/bin/ceph-mon() [0x497317] > 2: (Monitor::init()+0xc5a) [0x4857fa] > 3: (main()+0x2789) [0x46ac79] > 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d] > 5: /usr/bin/ceph-mon() [0x468309] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- begin dump of recent events --- > -3> 2012-07-04 11:20:24.447613 7f423d943780 1 store(/data/ceph/mon) mount > -2> 2012-07-04 11:20:24.447722 7f423d943780 0 ceph version > 0.48argonaut-125-g4e774fb (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8), > process ceph-mon, pid 7436 > -1> 2012-07-04 11:20:24.448430 7f423d943780 1 mon.3@-1(probing) e1 init > fsid 4553d0f6-1b31-4ba5-9d97-edae55bcaab4 > 0> 2012-07-04 11:20:24.448994 7f423d943780 -1 mon/Paxos.cc (http://Paxos.cc): In function > 'bool Paxos::is_consistent()' thread 7f423d943780 time 2012-07-04 11:20:24.448637 > mon/Paxos.cc (http://Paxos.cc): 1031: FAILED assert(consistent || (slurping == 1)) > > ceph version 0.48argonaut-125-g4e774fb > (commit:4e774fbcb38fd6883232b72352512a5f8e4a66e8) > 1: /usr/bin/ceph-mon() [0x497317] > 2: (Monitor::init()+0xc5a) [0x4857fa] > 3: (main()+0x2789) [0x46ac79] > 4: (__libc_start_main()+0xfd) [0x7f423bcfbc8d] > 5: /usr/bin/ceph-mon() [0x468309] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > --- end dump of recent events --- > 2012-07-04 11:20:24.449567 7f423d943780 -1 *** Caught signal (Aborted) ** > in thread 7f423d943780 > > > > > > And while you're at it, is the rest of the cluster in fact working? I don't think 0.43 to 0.48 is an upgrade path we tested. > > Anyway, i removed the mon and did a ceph-mon --mkfs with the 3 mons that were > still working after the upgrade and got it up and running again. > > Yes, the cluster is still working after the upgrade. Also upgraded to linux > 3.4.4 - it feels like the ceph-fuse and kernel ceph client is a little less > robust than in 0.43... > > when i start copying from /ceph to other mp, then it seems that for the copy > operation or in general for any operation, /ceph is unusable to other processes > which then makes the client behave very sluggish... :( Well, it shouldn't have gotten less stable since we haven't made any big changes there…but you aren't the only one reporting that things seem to be a little bit slower. We're going to have to look at that once people are back in the office after Independence Day. > > i can send you the contents of the monitor directory where it did not work after > the upgrade if you want me to.. No, that won't be necessary. Thanks though! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html