I apologise for the delay in getting back to you. I just pushed a branch called wip-snap-workaround based on v0.45. It should at least avoid the crash you saw. Let me know if you hit further trouble. -Sam On Thu, Apr 26, 2012 at 8:12 AM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: > Hi, > > Anyone have any idea how to fix this ? Can i just correct conflict > data in osdmaps ? > > > On Wed, Apr 25, 2012 at 3:42 PM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: >> After removing pool snapshot I was trying to make self managed >> snapshot and after reading source this was the root cause of this >> problem. >> >> >> On Wed, Apr 25, 2012 at 1:24 PM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: >>> after upgrade to v0.45 stack trace is as follows: >>> >>> Program received signal SIGABRT, Aborted. >>> [Switching to Thread 0x7fffeac55700 (LWP 11011)] >>> 0x00007ffff5ebb445 in raise () from /lib/x86_64-linux-gnu/libc.so.6 >>> (gdb) bt >>> #0 0x00007ffff5ebb445 in raise () from /lib/x86_64-linux-gnu/libc.so.6 >>> #1 0x00007ffff5ebebab in abort () from /lib/x86_64-linux-gnu/libc.so.6 >>> #2 0x00007ffff680969d in __gnu_cxx::__verbose_terminate_handler() () >>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #3 0x00007ffff6807846 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #4 0x00007ffff6807873 in std::terminate() () >>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #5 0x00007ffff680796e in __cxa_throw () >>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #6 0x0000000000695ec0 in ceph::__ceph_assert_fail ( >>> assertion=0x80b1ed "_size >= 0", file=0x80a222 "./include/interval_set.h", >>> line=382, >>> func=0x81bf60 "void interval_set<T>::erase(T, T) [with T = snapid_t]") >>> at common/assert.cc:75 >>> #7 0x00000000005d1359 in erase (len=..., start=..., this=0xbe5738) >>> at ./include/interval_set.h:382 >>> #8 subtract (a=..., this=0xbe5738) at ./include/interval_set.h:404 >>> #9 OSD::advance_map (this=0xbca000, t=..., tfin=<optimized out>) >>> at osd/OSD.cc:3475 >>> #10 0x00000000005d33bf in OSD::handle_osd_map (this=0xbca000, m=0x2a4c800) >>> at osd/OSD.cc:3272 >>> #11 0x00000000005d4c9b in OSD::_dispatch (this=0xbca000, m=0x2a4c800) >>> at osd/OSD.cc:2780 >>> ---Type <return> to continue, or q <return> to quit--- >>> #12 0x00000000005d52a5 in OSD::ms_dispatch (this=0xbca000, m=0x2a4c800) >>> at osd/OSD.cc:2605 >>> #13 0x000000000067a91b in ms_deliver_dispatch (m=0x2a4c800, this=0xba8680) >>> at msg/Messenger.h:178 >>> #14 SimpleMessenger::dispatch_entry (this=0xba8680) >>> at msg/SimpleMessenger.cc:363 >>> #15 0x0000000000648f1d in SimpleMessenger::DispatchThread::entry ( >>> this=<optimized out>) at msg/SimpleMessenger.h:560 >>> #16 0x00007ffff79c2e9a in start_thread () >>> from /lib/x86_64-linux-gnu/libpthread.so.0 >>> #17 0x00007ffff5f774bd in clone () from /lib/x86_64-linux-gnu/libc.so.6 >>> #18 0x0000000000000000 in ?? () >>> >>> >>> On Wed, Apr 25, 2012 at 12:11 PM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: >>>> osd dump is like this: >>>> >>>> pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num >>>> 768 pgp_num 768 lpg_num 2 lpgp_num 2 last_change 1 owner 0 >>>> crash_replay_interval 45 >>>> pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins >>>> pg_num 768 pgp_num 768 lpg_num 2 lpgp_num 2 last_change 1 owner 0 >>>> pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num >>>> 768 pgp_num 768 lpg_num 2 lpgp_num 2 last_change 1 owner 0 >>>> pool 9 'nova' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num >>>> 2568 pgp_num 2568 lpg_num 0 lpgp_num 0 last_change 1435 owner >>>> 18446744073709551615 >>>> removed_snaps [1~1] >>>> pool 10 'glance' rep size 2 crush_ruleset 0 object_hash rjenkins >>>> pg_num 2568 pgp_num 2568 lpg_num 0 lpgp_num 0 last_change 132 owner >>>> 18446744073709551615 >>>> >>>> >>>> On Wed, Apr 25, 2012 at 11:04 AM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: >>>>> Hi, >>>>> >>>>> After making and removing snapshot from one of the pools, all of the >>>>> osd in cluster are dying with log like below: >>>>> >>>>> >>>>> 2012-04-25 11:01:00.938313 7f66694b9700 osd.1 1434 removing old >>>>> osdmap epoch 966 >>>>> 2012-04-25 11:01:00.938330 7f66694b9700 osd.1 1434 removing old >>>>> osdmap epoch 967 >>>>> 2012-04-25 11:01:00.938348 7f66694b9700 osd.1 1434 advance to epoch >>>>> 1435 (<= newest 1470) >>>>> 2012-04-25 11:01:00.939437 7f66694b9700 osd.1 1435 advance_map epoch >>>>> 1435 1325 pgs >>>>> 2012-04-25 11:01:00.939455 7f66694b9700 osd.1 1435 pool 0 removed >>>>> snaps [], unchanged (snap_epoch = 0) >>>>> 2012-04-25 11:01:00.939469 7f66694b9700 osd.1 1435 pool 1 removed >>>>> snaps [], unchanged (snap_epoch = 0) >>>>> 2012-04-25 11:01:00.939482 7f66694b9700 osd.1 1435 pool 2 removed >>>>> snaps [], unchanged (snap_epoch = 0) >>>>> ./include/interval_set.h: In function 'void interval_set<T>::erase(T, >>>>> T) [with T = snapid_t]' thread 7f66694b9700 time 2012-04-25 >>>>> 11:01:00.939509 >>>>> ./include/interval_set.h: 382: FAILED assert(_size >= 0) >>>>> ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee) >>>>> 1: (OSD::advance_map(ObjectStore::Transaction&, C_Contexts*)+0x2971) [0x5cfb51] >>>>> 2: (OSD::handle_osd_map(MOSDMap*)+0x193c) [0x5d162c] >>>>> 3: (OSD::_dispatch(Message*)+0x2eb) [0x5d34fb] >>>>> 4: (OSD::ms_dispatch(Message*)+0x129) [0x5d3a59] >>>>> 5: (SimpleMessenger::dispatch_entry()+0x78b) [0x67513b] >>>>> 6: (SimpleMessenger::DispatchThread::entry()+0xd) [0x52124d] >>>>> 7: (()+0x7e9a) [0x7f6676226e9a] >>>>> 8: (clone()+0x6d) [0x7f66747db4bd] >>>>> ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee) >>>>> 1: (OSD::advance_map(ObjectStore::Transaction&, C_Contexts*)+0x2971) [0x5cfb51] >>>>> 2: (OSD::handle_osd_map(MOSDMap*)+0x193c) [0x5d162c] >>>>> 3: (OSD::_dispatch(Message*)+0x2eb) [0x5d34fb] >>>>> 4: (OSD::ms_dispatch(Message*)+0x129) [0x5d3a59] >>>>> 5: (SimpleMessenger::dispatch_entry()+0x78b) [0x67513b] >>>>> 6: (SimpleMessenger::DispatchThread::entry()+0xd) [0x52124d] >>>>> 7: (()+0x7e9a) [0x7f6676226e9a] >>>>> 8: (clone()+0x6d) [0x7f66747db4bd] >>>>> *** Caught signal (Aborted) ** >>>>> in thread 7f66694b9700 >>>>> ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee) >>>>> 1: /usr/bin/ceph-osd() [0x6fa0c6] >>>>> 2: (()+0xfcb0) [0x7f667622ecb0] >>>>> 3: (gsignal()+0x35) [0x7f667471f445] >>>>> 4: (abort()+0x17b) [0x7f6674722bab] >>>>> 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f667506d69d] >>>>> 6: (()+0xb5846) [0x7f667506b846] >>>>> 7: (()+0xb5873) [0x7f667506b873] >>>>> 8: (()+0xb596e) [0x7f667506b96e] >>>>> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>> const*)+0x200) [0x68f420] >>>>> 10: (OSD::advance_map(ObjectStore::Transaction&, C_Contexts*)+0x2971) >>>>> [0x5cfb51] >>>>> 11: (OSD::handle_osd_map(MOSDMap*)+0x193c) [0x5d162c] >>>>> 12: (OSD::_dispatch(Message*)+0x2eb) [0x5d34fb] >>>>> 13: (OSD::ms_dispatch(Message*)+0x129) [0x5d3a59] >>>>> 14: (SimpleMessenger::dispatch_entry()+0x78b) [0x67513b] >>>>> 15: (SimpleMessenger::DispatchThread::entry()+0xd) [0x52124d] >>>>> 16: (()+0x7e9a) [0x7f6676226e9a] >>>>> 17: (clone()+0x6d) [0x7f66747db4bd] >>>>> >>>>> >>>>> -- >>>>> Tomasz Paszkowski >>>>> SS7, Asterisk, SAN, Datacenter, Cloud Computing >>>>> +48500166299 >>>> >>>> >>>> >>>> -- >>>> Tomasz Paszkowski >>>> SS7, Asterisk, SAN, Datacenter, Cloud Computing >>>> +48500166299 >>> >>> >>> >>> -- >>> Tomasz Paszkowski >>> SS7, Asterisk, SAN, Datacenter, Cloud Computing >>> +48500166299 >> >> >> >> -- >> Tomasz Paszkowski >> SS7, Asterisk, SAN, Datacenter, Cloud Computing >> +48500166299 > > > > -- > Tomasz Paszkowski > SS7, Asterisk, SAN, Datacenter, Cloud Computing > +48500166299 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html