Hi Updated the logfile, same place http://beta.xaasbox.com/ceph/ceph-osd.15.log Br, Tuomas -----Original Message----- From: Sage Weil [mailto:sweil@xxxxxxxxxx] Sent: 27. huhtikuuta 2015 22:22 To: Tuomas Juntunen Cc: ceph-users@xxxxxxxxxxxxxx Subject: RE: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down On Mon, 27 Apr 2015, Tuomas Juntunen wrote: > Hey > > Got the log, you can get it from > http://beta.xaasbox.com/ceph/ceph-osd.15.log Can you repeat this with 'debug osd = 20'? Thanks! sage > > Br, > Tuomas > > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: 27. huhtikuuta 2015 20:45 > To: Tuomas Juntunen > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Upgrade from Giant to Hammer and after some > basic operations most of the OSD's went down > > Yeah, no snaps: > > images: > "snap_mode": "selfmanaged", > "snap_seq": 0, > "snap_epoch": 17882, > "pool_snaps": [], > "removed_snaps": "[]", > > img: > "snap_mode": "selfmanaged", > "snap_seq": 0, > "snap_epoch": 0, > "pool_snaps": [], > "removed_snaps": "[]", > > ...and actually the log shows this happens on pool 2 (rbd), which has > > "snap_mode": "selfmanaged", > "snap_seq": 0, > "snap_epoch": 0, > "pool_snaps": [], > "removed_snaps": "[]", > > I'm guessin gthe offending code is > > pi->build_removed_snaps(newly_removed_snaps); > newly_removed_snaps.subtract(cached_removed_snaps); > > so newly_removed_snaps should be empty, and apparently > cached_removed_snaps is not? Maybe one of your older osdmaps has snap > info for rbd? It doesn't make sense. :/ Maybe > > ceph osd dump 18127 -f json-pretty > > just to be certain? I've pushed a branch 'wip-hammer-snaps' that will > appear at gitbuilder.ceph.com in 20-30 minutes that will output some > additional debug info. It will be at > > > http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-hammer > -sanps > > or similar, depending on your distro. Can you install it one on node > and start and osd with logging to reproduce the crash? > > Thanks! > sage > > > On Mon, 27 Apr 2015, Tuomas Juntunen wrote: > > > Hi > > > > Here you go > > > > Br, > > Tuomas > > > > > > > > -----Original Message----- > > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > Sent: 27. huhtikuuta 2015 19:23 > > To: Tuomas Juntunen > > Cc: 'Samuel Just'; ceph-users@xxxxxxxxxxxxxx > > Subject: Re: Upgrade from Giant to Hammer and after > > some basic operations most of the OSD's went down > > > > On Mon, 27 Apr 2015, Tuomas Juntunen wrote: > > > Thanks for the info. > > > > > > For my knowledge there was no snapshots on that pool, but cannot > > > verify that. > > > > Can you attach a 'ceph osd dump -f json-pretty'? That will shed a > > bit more light on what happened (and the simplest way to fix it). > > > > sage > > > > > > > Any way to make this work again? Removing the tier and other > > > settings didn't fix it, I tried it the second this happened. > > > > > > Br, > > > Tuomas > > > > > > -----Original Message----- > > > From: Samuel Just [mailto:sjust@xxxxxxxxxx] > > > Sent: 27. huhtikuuta 2015 15:50 > > > To: tuomas juntunen > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: Upgrade from Giant to Hammer and after > > > some basic operations most of the OSD's went down > > > > > > So, the base tier is what determines the snapshots for the > > > cache/base pool > > amalgam. You added a populated pool complete with snapshots on top > > of a base tier without snapshots. Apparently, it caused an > > existential crisis for the snapshot code. That's one of the reasons > > why there is a --force-nonempty flag for that operation, I think. I > > think the immediate answer is probably to disallow pools with > > snapshots as a cache tier altogether until we think of a good way to make it work. > > > -Sam > > > > > > ----- Original Message ----- > > > From: "tuomas juntunen" <tuomas.juntunen@xxxxxxxxxxxxxxx> > > > To: "Samuel Just" <sjust@xxxxxxxxxx> > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Sent: Monday, April 27, 2015 4:56:58 AM > > > Subject: Re: Upgrade from Giant to Hammer and after > > > some basic operations most of the OSD's went down > > > > > > > > > > > > The following: > > > > > > ceph osd tier add img images --force-nonempty ceph osd tier > > > cache-mode images forward ceph osd tier set-overlay img images > > > > > > Idea was to make images as a tier to img, move data to img then > > > change > > clients to use the new img pool. > > > > > > Br, > > > Tuomas > > > > > > > Can you explain exactly what you mean by: > > > > > > > > "Also I created one pool for tier to be able to move data > > > > without > > outage." > > > > > > > > -Sam > > > > ----- Original Message ----- > > > > From: "tuomas juntunen" <tuomas.juntunen@xxxxxxxxxxxxxxx> > > > > To: "Ian Colle" <icolle@xxxxxxxxxx> > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > Sent: Monday, April 27, 2015 4:23:44 AM > > > > Subject: Re: Upgrade from Giant to Hammer and after > > > > some basic operations most of the OSD's went down > > > > > > > > Hi > > > > > > > > Any solution for this yet? > > > > > > > > Br, > > > > Tuomas > > > > > > > >> It looks like you may have hit > > > >> http://tracker.ceph.com/issues/7915 > > > >> > > > >> Ian R. Colle > > > >> Global Director > > > >> of Software Engineering > > > >> Red Hat (Inktank is now part of Red Hat!) > > > >> http://www.linkedin.com/in/ircolle > > > >> http://www.twitter.com/ircolle > > > >> Cell: +1.303.601.7713 > > > >> Email: icolle@xxxxxxxxxx > > > >> > > > >> ----- Original Message ----- > > > >> From: "tuomas juntunen" <tuomas.juntunen@xxxxxxxxxxxxxxx> > > > >> To: ceph-users@xxxxxxxxxxxxxx > > > >> Sent: Monday, April 27, 2015 1:56:29 PM > > > >> Subject: Upgrade from Giant to Hammer and after > > > >> some basic operations most of the OSD's went down > > > >> > > > >> > > > >> > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 Hammer > > > >> > > > >> Then created new pools and deleted some old ones. Also I > > > >> created one pool for tier to be able to move data without outage. > > > >> > > > >> After these operations all but 10 OSD's are down and creating > > > >> this kind of messages to logs, I get more than 100gb of these > > > >> in a > night: > > > >> > > > >> -19> 2015-04-27 10:17:08.808584 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 > > > >> les/c > > > >> 16609/16659 > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 > > > >> crt=8480'7 lcod > > > >> 0'0 inactive NOTIFY] enter Started > > > >> -18> 2015-04-27 10:17:08.808596 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 > > > >> les/c > > > >> 16609/16659 > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 > > > >> crt=8480'7 lcod > > > >> 0'0 inactive NOTIFY] enter Start > > > >> -17> 2015-04-27 10:17:08.808608 7fd8e748d700 1 osd.23 pg_epoch: > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 > > > >> les/c > > > >> 16609/16659 > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 > > > >> crt=8480'7 lcod > > > >> 0'0 inactive NOTIFY] state<Start>: transitioning to Stray > > > >> -16> 2015-04-27 10:17:08.808621 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 > > > >> les/c > > > >> 16609/16659 > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 > > > >> crt=8480'7 lcod > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 0.000000 > > > >> -15> 2015-04-27 10:17:08.808637 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] local-les=16609 n=0 ec=1 > > > >> les/c > > > >> 16609/16659 > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 pi=15659-16589/42 > > > >> crt=8480'7 lcod > > > >> 0'0 inactive NOTIFY] enter Started/Stray > > > >> -14> 2015-04-27 10:17:08.808796 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] exit Reset 0.119467 4 0.000037 > > > >> -13> 2015-04-27 10:17:08.808817 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] enter Started > > > >> -12> 2015-04-27 10:17:08.808828 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] enter Start > > > >> -11> 2015-04-27 10:17:08.808838 7fd8e748d700 1 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] > > > >> state<Start>: transitioning to Stray > > > >> -10> 2015-04-27 10:17:08.808849 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] exit Start 0.000020 0 0.000000 > > > >> -9> 2015-04-27 10:17:08.808861 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[10.181( empty local-les=17879 n=0 ec=17863 les/c > > > >> 17879/17879 > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 crt=0'0 inactive > > > >> NOTIFY] enter Started/Stray > > > >> -8> 2015-04-27 10:17:08.809427 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] exit Reset 7.511623 45 0.000165 > > > >> -7> 2015-04-27 10:17:08.809445 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] enter Started > > > >> -6> 2015-04-27 10:17:08.809456 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] enter Start > > > >> -5> 2015-04-27 10:17:08.809468 7fd8e748d700 1 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] > > > >> state<Start>: transitioning to Primary > > > >> -4> 2015-04-27 10:17:08.809479 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] exit Start 0.000023 0 0.000000 > > > >> -3> 2015-04-27 10:17:08.809492 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] enter Started/Primary > > > >> -2> 2015-04-27 10:17:08.809502 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> inactive] enter Started/Primary/Peering > > > >> -1> 2015-04-27 10:17:08.809513 7fd8e748d700 5 osd.23 pg_epoch: > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 les/c > > > >> 16127/16344 > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 crt=0'0 mlcod 0'0 > > > >> peering] enter Started/Primary/Peering/GetInfo > > > >> 0> 2015-04-27 10:17:08.813837 7fd8e748d700 -1 > > ./include/interval_set.h: > > > >> In > > > >> function 'void interval_set<T>::erase(T, T) [with T = snapid_t]' > > > >> thread > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899 > > > >> ./include/interval_set.h: 385: FAILED assert(_size >= 0) > > > >> > > > >> ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) > > > >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, > > > >> char > > > >> const*)+0x8b) > > > >> [0xbc271b] > > > >> 2: (interval_set<snapid_t>::subtract(interval_set<snapid_t> > > > >> const&)+0xb0) [0x82cd50] > > > >> 3: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x52e) > > > >> [0x80113e] > > > >> 4: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>, > > > >> std::tr1::shared_ptr<OSDMap const>, std::vector<int, > > > >> std::allocator<int> >&, int, std::vector<int, > > > >> std::allocator<int> > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652] > > > >> 5: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, > > > >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, > > > >> std::less<boost::intrusive_ptr<PG> >, > > > >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x2c3) [0x6b0e43] > > > >> 6: (OSD::process_peering_events(std::list<PG*, > > > >> std::allocator<PG*> > > > >> > const&, > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c] > > > >> 7: (OSD::PeeringWQ::_process(std::list<PG*, > > > >> std::allocator<PG*> > > > >> > const&, > > > >> ThreadPool::TPHandle&)+0x18) [0x709278] > > > >> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) > > > >> [0xbb38ae] > > > >> 9: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950] > > > >> 10: (()+0x8182) [0x7fd906946182] > > > >> 11: (clone()+0x6d) [0x7fd904eb147d] > > > >> > > > >> Also by monitoring (ceph -w) I get the following messages, also > > > >> lots of > > them. > > > >> > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] from='client.? > > 10.20.0.13:0/1174409' > > > >> entity='osd.30' cmd=[{"prefix": "osd crush create-or-move", "args": > > > >> ["host=ceph3", "root=default"], "id": 30, "weight": 1.82}]: > > > >> dispatch > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] from='client.? > > 10.20.0.13:0/1174483' > > > >> entity='osd.26' cmd=[{"prefix": "osd crush create-or-move", "args": > > > >> ["host=ceph3", "root=default"], "id": 26, "weight": 1.82}]: > > > >> dispatch > > > >> > > > >> > > > >> This is a cluster of 3 nodes with 36 OSD's, nodes are also mons > > > >> and mds's to save servers. All run Ubuntu 14.04.2. > > > >> > > > >> I have pretty much tried everything I could think of. > > > >> > > > >> Restarting daemons doesn't help. > > > >> > > > >> Any help would be appreciated. I can also provide more logs if > > > >> necessary. They just seem to get pretty large in few moments. > > > >> > > > >> Thank you > > > >> Tuomas > > > >> > > > >> > > > >> _______________________________________________ > > > >> ceph-users mailing list > > > >> ceph-users@xxxxxxxxxxxxxx > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >> > > > >> > > > >> > > > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com