Hi below is the mds dump dumped mdsmap epoch 1799 epoch 1799 flags 0 created 2014-12-10 12:44:34.188118 modified 2015-05-04 07:16:37.205350 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 1794 last_failure_osd_epoch 21750 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=5827504} failed stopped data_pools 0 metadata_pool 1 inline_data disabled 5827504: 10.20.0.11:6800/3382530 'ceph1' mds.0.262 up:rejoin seq 33159 The active+clean+replay has been there for a day now, so there must be something that is not ok, if it should've gone away in cople of minutes. Thanks Tuomas -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxxx] Sent: 4. toukokuuta 2015 18:29 To: Tuomas Juntunen Cc: ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx Subject: RE: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down On Mon, 4 May 2015, Tuomas Juntunen wrote: > Hi > > Thanks Sage, I got it working now. Everything else seems to be ok, > except mds is reporting "mds cluster is degraded", not sure what could be wrong. > Mds is running and all osds are up and pg's are active+clean and > active+clean+replay. Great! The 'replay' part should clear after a minute or two. > Had to delete some empty pools which were created while the osd's were > not working and recovery started to go through. > > Seems mds is not that stable, this isn't the first time it goes degraded. > Before it just started to work, but now I just can't get it back working. What does 'ceph mds dump' say? sage > > Thanks > > Br, > Tuomas > > > -----Original Message----- > From: tuomas.juntunen@xxxxxxxxxxxxxxx > [mailto:tuomas.juntunen@xxxxxxxxxxxxxxx] > Sent: 1. toukokuuta 2015 21:14 > To: Sage Weil > Cc: tuomas.juntunen; ceph-users@xxxxxxxxxxxxxx; > ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: Upgrade from Giant to Hammer and after some > basic operations most of the OSD's went down > > Thanks, I'll do this when the commit is available and report back. > > And indeed, I'll change to the official ones after everything is ok. > > Br, > Tuomas > > > On Fri, 1 May 2015, tuomas.juntunen@xxxxxxxxxxxxxxx wrote: > >> Hi > >> > >> I deleted the images and img pools and started osd's, they still die. > >> > >> Here's a log of one of the osd's after this, if you need it. > >> > >> http://beta.xaasbox.com/ceph/ceph-osd.19.log > > > > I've pushed another commit that should avoid this case, sha1 > > 425bd4e1dba00cc2243b0c27232d1f9740b04e34. > > > > Note that once the pools are fully deleted (shouldn't take too long > > once the osds are up and stabilize) you should switch back to the > > normal packages that don't have these workarounds. > > > > sage > > > > > > > >> > >> Br, > >> Tuomas > >> > >> > >> > Thanks man. I'll try it tomorrow. Have a good one. > >> > > >> > Br,T > >> > > >> > -------- Original message -------- > >> > From: Sage Weil <sage@xxxxxxxxxxxx> > >> > Date: 30/04/2015 18:23 (GMT+02:00) > >> > To: Tuomas Juntunen <tuomas.juntunen@xxxxxxxxxxxxxxx> > >> > Cc: ceph-users@xxxxxxxxxxxxxx, ceph-devel@xxxxxxxxxxxxxxx > >> > Subject: RE: Upgrade from Giant to Hammer and after > >> > some basic > >> > >> > operations most of the OSD's went down > >> > > >> > On Thu, 30 Apr 2015, tuomas.juntunen@xxxxxxxxxxxxxxx wrote: > >> >> Hey > >> >> > >> >> Yes I can drop the images data, you think this will fix it? > >> > > >> > It's a slightly different assert that (I believe) should not > >> > trigger once the pool is deleted. Please give that a try and if > >> > you still hit it I'll whip up a workaround. > >> > > >> > Thanks! > >> > sage > >> > > >> > > > >> >> > >> >> Br, > >> >> > >> >> Tuomas > >> >> > >> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote: > >> >> >> Hi > >> >> >> > >> >> >> I updated that version and it seems that something did > >> >> >> happen, the osd's stayed up for a while and 'ceph status' got updated. > >> >> >> But then in couple > >> of > >> >> >> minutes, they all went down the same way. > >> >> >> > >> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a > >> >> >> new log > >> from > >> >> >> one of the osd's with osd debug = 20, > >> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log > >> >> > > >> >> > Sam mentioned that you had said earlier that this was not > >> >> > critical > data? > >> >> > If not, I think the simplest thing is to just drop those > >> >> > pools. The important thing (from my perspective at least :) > >> >> > is that we understand > >> the > >> >> > root cause and can prevent this in the future. > >> >> > > >> >> > sage > >> >> > > >> >> > > >> >> >> > >> >> >> Thank you! > >> >> >> > >> >> >> Br, > >> >> >> Tuomas > >> >> >> > >> >> >> > >> >> >> > >> >> >> -----Original Message----- > >> >> >> From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > >> >> >> Sent: 28. huhtikuuta 2015 23:57 > >> >> >> To: Tuomas Juntunen > >> >> >> Cc: ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > >> >> >> Subject: Re: Upgrade from Giant to Hammer and > >> >> >> after some > >> basic > >> >> >> operations most of the OSD's went down > >> >> >> > >> >> >> Hi Tuomas, > >> >> >> > >> >> >> I've pushed an updated wip-hammer-snaps branch. Can you > >> >> >> please > try it? > >> >> >> The build will appear here > >> >> >> > >> >> >> > >> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/ > >> >> >> 08 > >> >> >> bf531331afd5e > >> >> >> 2eb514067f72afda11bcde286 > >> >> >> > >> >> >> (or a similar url; adjust for your distro). > >> >> >> > >> >> >> Thanks! > >> >> >> sage > >> >> >> > >> >> >> > >> >> >> On Tue, 28 Apr 2015, Sage Weil wrote: > >> >> >> > >> >> >> > [adding ceph-devel] > >> >> >> > > >> >> >> > Okay, I see the problem. This seems to be unrelated ot > >> >> >> > the giant -> hammer move... it's a result of the tiering > >> >> >> > changes you > made: > >> >> >> > > >> >> >> > > > > > > > The following: > >> >> >> > > > > > > > > >> >> >> > > > > > > > ceph osd tier add img images --force-nonempty > >> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph > >> >> >> > > > > > > > osd tier set-overlay img images > >> >> >> > > >> >> >> > Specifically, --force-nonempty bypassed important safety checks. > >> >> >> > > >> >> >> > 1. images had snapshots (and removed_snaps) > >> >> >> > > >> >> >> > 2. images was added as a tier *of* img, and img's > >> >> >> > removed_snaps was copied to images, clobbering the > >> >> >> > removed_snaps value (see > >> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers) > >> >> >> > > >> >> >> > 3. tiering relation was undone, but removed_snaps was still > >> >> >> > gone > >> >> >> > > >> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is > >> >> >> > initialized with the older map. later, in > >> >> >> > PGPool::update(), we assume that removed_snaps alwasy grows > >> >> >> > (never shrinks) and we > trigger an assert. > >> >> >> > > >> >> >> > To fix this I think we need to do 2 things: > >> >> >> > > >> >> >> > 1. make the OSD forgiving out removed_snaps getting > >> >> >> > smaller. This is probably a good thing anyway: once we > >> >> >> > know snaps are removed on all OSDs we can prune the > >> >> >> > interval_set in the > OSDMap. Maybe. > >> >> >> > > >> >> >> > 2. Fix the mon to prevent this from happening, *even* when > >> >> >> > --force-nonempty is specified. (This is the root cause.) > >> >> >> > > >> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this. > >> >> >> > > >> >> >> > sage > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > > > > > > > > >> >> >> > > > > > > > Idea was to make images as a tier to img, move > >> >> >> > > > > > > > data to img then change > >> >> >> > > > > > > clients to use the new img pool. > >> >> >> > > > > > > > > >> >> >> > > > > > > > Br, > >> >> >> > > > > > > > Tuomas > >> >> >> > > > > > > > > >> >> >> > > > > > > > > Can you explain exactly what you mean by: > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > "Also I created one pool for tier to be able > >> >> >> > > > > > > > > to move data without > >> >> >> > > > > > > outage." > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > -Sam > >> >> >> > > > > > > > > ----- Original Message ----- > >> >> >> > > > > > > > > From: "tuomas juntunen" > >> >> >> > > > > > > > > <tuomas.juntunen@xxxxxxxxxxxxxxx> > >> >> >> > > > > > > > > To: "Ian Colle" <icolle@xxxxxxxxxx> > >> >> >> > > > > > > > > Cc: ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM > >> >> >> > > > > > > > > Subject: Re: Upgrade from Giant > >> >> >> > > > > > > > > to Hammer and after some basic operations > >> >> >> > > > > > > > > most of the OSD's went down > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > Hi > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > Any solution for this yet? > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > Br, > >> >> >> > > > > > > > > Tuomas > >> >> >> > > > > > > > > > >> >> >> > > > > > > > >> It looks like you may have hit > >> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915 > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Ian R. Colle Global Director of Software > >> >> >> > > > > > > > >> Engineering Red Hat (Inktank is now part of > >> >> >> > > > > > > > >> Red Hat!) http://www.linkedin.com/in/ircolle > >> >> >> > > > > > > > >> http://www.twitter.com/ircolle > >> >> >> > > > > > > > >> Cell: +1.303.601.7713 > >> >> >> > > > > > > > >> Email: icolle@xxxxxxxxxx > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> ----- Original Message ----- > >> >> >> > > > > > > > >> From: "tuomas juntunen" > >> >> >> > > > > > > > >> <tuomas.juntunen@xxxxxxxxxxxxxxx> > >> >> >> > > > > > > > >> To: ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM > >> >> >> > > > > > > > >> Subject: Upgrade from Giant to > >> >> >> > > > > > > > >> Hammer and after some basic operations most > >> >> >> > > > > > > > >> of the OSD's went down > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 > >> >> >> > > > > > > > >> Hammer > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Then created new pools and deleted some old > >> >> >> > > > > > > > >> ones. Also I created one pool for tier to be > >> >> >> > > > > > > > >> able to move data without > >> >> >> > > outage. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> After these operations all but 10 OSD's are > >> >> >> > > > > > > > >> down and creating this kind of messages to > >> >> >> > > > > > > > >> logs, I get more than 100gb of these in a > >> >> >> > > > > > night: > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> -19> 2015-04-27 10:17:08.808584 > >> >> >> > > > > > > > >>7fd8e748d700 5 > >> osd.23 > >> >> >> > > pg_epoch: > >> >> >> > > > > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] > >> >> >> > > > > > > > >>local-les=16609 > >> >> >> > > > > > > > >> n=0 > >> >> >> > > > > > > > >> ec=1 les/c > >> >> >> > > > > > > > >> 16609/16659 > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 > >> >> >> > > > > > > > >> pi=15659-16589/42 > >> >> >> > > > > > > > >> crt=8480'7 lcod > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started    > >> >> >> > > > > > > > >>-18> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] > >> >> >> > > > > > > > >>local-les=16609 > >> >> >> > > > > > > > >> n=0 > >> >> >> > > > > > > > >> ec=1 les/c > >> >> >> > > > > > > > >> 16609/16659 > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 > >> >> >> > > > > > > > >> pi=15659-16589/42 > >> >> >> > > > > > > > >> crt=8480'7 lcod > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start    -17> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700 1 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] > >> >> >> > > > > > > > >> local-les=16609 > >> >> >> > > > > > > > >> n=0 > >> >> >> > > > > > > > >> ec=1 les/c > >> >> >> > > > > > > > >> 16609/16659 > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 > >> >> >> > > > > > > > >> pi=15659-16589/42 > >> >> >> > > > > > > > >> crt=8480'7 lcod > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: > >> >> >> > > > > > > > >> transitioning to > >> Stray > >> >> >> > > > > > > > >>   -16> 2015-04-27 10:17:08.808621 > >> >> >> > > > > > > > >>7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] > >> >> >> > > > > > > > >>local-les=16609 > >> >> >> > > > > > > > >> n=0 > >> >> >> > > > > > > > >> ec=1 les/c > >> >> >> > > > > > > > >> 16609/16659 > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 > >> >> >> > > > > > > > >> pi=15659-16589/42 > >> >> >> > > > > > > > >> crt=8480'7 lcod > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 > >> >> >> > > > > > > > >>0.000000    -15> 2015-04-27 > >> >> >> > > > > > > > >>10:17:08.808637 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] > >> >> >> > > > > > > > >>local-les=16609 > >> >> >> > > > > > > > >> n=0 > >> >> >> > > > > > > > >> ec=1 les/c > >> >> >> > > > > > > > >> 16609/16659 > >> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838 > >> >> >> > > > > > > > >> pi=15659-16589/42 > >> >> >> > > > > > > > >> crt=8480'7 lcod > >> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray   > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700 > >> >> >> > > > > > > > >>5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] exit Reset 0.119467 > >> >> >> > > > > > > > >>4 > >> >> >> > > > > > > > >>0.000037    -13> 2015-04-27 > >> >> >> > > > > > > > >>10:17:08.808817 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] enter Started    > >> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700 > >> >> >> > > > > > > > >>5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] enter Start    > >> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700 > >> >> >> > > > > > > > >>1 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] > >> >> >> > > > > > > > >> state<Start>: transitioning to Stray    > >> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700 > >> >> >> > > > > > > > >>5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] exit Start 0.000020 > >> >> >> > > > > > > > >>0 > >> >> >> > > > > > > > >>0.000000     -9> 2015-04-27 > >> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 > >> >> >> > > > > > > > >>ec=17863 les/c > >> >> >> > > > > > > > >> 17879/17879 > >> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 > >> >> >> > > > > > > > >>crt=0'0 inactive NOTIFY] enter Started/Stray > >> >> >> > > > > > > > >>    -8> 2015-04-27 10:17:08.809427 > >> >> >> > > > > > > > >>7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 > >> >> >> > > > > > > > >>0.000165     -7> 2015-04-27 > >> >> >> > > > > > > > >>10:17:08.809445 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] enter Started     -6> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] enter Start     -5> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700 1 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] > >> >> >> > > > > > > > >> state<Start>: transitioning to Primary    > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700 > >> >> >> > > > > > > > >>-4> 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000 > >> >> >> > > > > > > > >>    -3> 2015-04-27 10:17:08.809492 > >> >> >> > > > > > > > >>7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary     > >> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700 > >> >> >> > > > > > > > >>-2> 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering > >> >> >> > > > > > > > >>    -1> 2015-04-27 10:17:08.809513 > >> >> >> > > > > > > > >>7fd8e748d700 5 > >> >> >> > > > > > > > >> osd.23 > >> >> >> > > > pg_epoch: > >> >> >> > > > > > >> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 > >> >> >> > > > > > > > >>ec=1 les/c > >> >> >> > > > > > > > >> 16127/16344 > >> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 > >> >> >> > > > > > > > >>crt=0'0 mlcod > >> >> >> > > > > > > > >> 0'0 peering] enter > >> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo      0> > >> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1 > >> >> >> > > > > > > ./include/interval_set.h: > >> >> >> > > > > > > > >> In > >> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) > >> >> >> > > > > > > > >> [with T = > >> >> >> > > snapid_t]' > >> >> >> > > > > > > > >> thread > >> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899 > >> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED > >> >> >> > > > > > > > >> assert(_size >= > >> >> >> > > > > > > > >> 0) > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> ceph version 0.94.1 > >> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) > >> >> >> > > > > > > > >> 1: (ceph::__ceph_assert_fail(char const*, > >> >> >> > > > > > > > >>char > >> const*, > >> >> >> > > > > > > > >> int, char > >> >> >> > > > > > > > >> const*)+0x8b) [0xbc271b]  2: > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_se > >> >> >> > > > > > > > >>t< > >> >> >> > > > > > > > >>snapid_t > >> >> >> > > > > > > > >> > > >> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50]  3: > >> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap > >> >> >> > > > > > > > >> const>)+0x52e) [0x80113e] > >> >> >> > > > > > > > >> 4: > >> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap > >> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, > >> >> >> > > > > > > > >> const>std::vector<int, > >> >> >> > > > > > > > >> std::allocator<int> >&, int, > >> >> >> > > > > > > > >> std::vector<int, std::allocator<int> > >> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652] > >> >> >> > > > > > > > >> 5: (OSD::advance_pg(unsigned int, PG*, > >> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*, > >> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>, > >> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >, > >> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> > > >> >> >> > > > > > > > >>>*)+0x2c3) [0x6b0e43]  6: > >> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*, > >> >> >> > > > > > > > >> std::allocator<PG*> > >> >> >> > > > > > > > >> > const&, > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c]  7: > >> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*, > >> >> >> > > > > > > > >> std::allocator<PG*> > >> >> >> > > > > > > > >> > const&, > >> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278]  8: > >> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) > >> >> >> > > > > > > > >> [0xbb38ae] > >> >> >> > > > > > > > >> 9: (ThreadPool::WorkThread::entry()+0x10) > >> >> >> > > > > > > > >>[0xbb4950]  10: (()+0x8182) > >> >> >> > > > > > > > >>[0x7fd906946182]  11: (clone()+0x6d) > >> >> >> > > > > > > > >>[0x7fd904eb147d] > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the > >> >> >> > > > > > > > >> following messages, also lots of > >> >> >> > > > > > > them. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF] > from='client.? > >> >> >> > > > > > > 10.20.0.13:0/1174409' > >> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush > >> >> >> > > > > > > > >> create-or-move", > >> >> >> > > > "args": > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30, > "weight": > >> >> 1.82}]: > >> >> >> > >> >> >> > > > > > > > >> dispatch > >> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF] > from='client.? > >> >> >> > > > > > > 10.20.0.13:0/1174483' > >> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush > >> >> >> > > > > > > > >> create-or-move", > >> >> >> > > > "args": > >> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26, > "weight": > >> >> 1.82}]: > >> >> >> > >> >> >> > > > > > > > >> dispatch > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, > >> >> >> > > > > > > > >> nodes are also mons and mds's to save servers. > >> >> >> > > > > > > > >> All run Ubuntu > >> >> >> 14.04.2. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> I have pretty much tried everything I could > >> >> >> > > > > > > > >> think > of. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Restarting daemons doesn't help. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Any help would be appreciated. I can also > >> >> >> > > > > > > > >> provide more logs if necessary. They just > >> >> >> > > > > > > > >> seem to get pretty large in few > >> >> >> > > moments. > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> Thank you > >> >> >> > > > > > > > >> Tuomas > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> ____________________________________________ > >> >> >> > > > > > > > >> __ _ ceph-users mailing list > >> >> >> > > > > > > > >> ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-user > >> >> >> > > > > > > > >> s- > >> >> >> > > > > > > > >> ceph.com > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > >> > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > _____________________________________________ > >> >> >> > > > > > > > > __ ceph-users mailing list > >> >> >> > > > > > > > > ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users > >> >> >> > > > > > > > > -c > >> >> >> > > > > > > > > eph.com > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > _______________________________________________ > >> >> >> > > > > > > > ceph-users mailing list > >> >> >> > > > > > > > ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c > >> >> >> > > > > > > > ep > >> >> >> > > > > > > > h.com > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > _______________________________________________ > >> >> >> > > > > > > > ceph-users mailing list > >> >> >> > > > > > > > ceph-users@xxxxxxxxxxxxxx > >> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c > >> >> >> > > > > > > > ep > >> >> >> > > > > > > > h.com > >> >> >> > > > > > > > > >> >> >> > > > > > > > > >> >> >> > > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > >> >> >> > > > > > >> >> >> > > > > >> >> >> > > > >> >> >> > > > >> >> >> > _______________________________________________ > >> >> >> > ceph-users mailing list > >> >> >> > ceph-users@xxxxxxxxxxxxxx > >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> >> > > >> >> >> > > >> >> >> > >> >> > > >> >> > >> >> > >> >> -- > >> >> To unsubscribe from this list: send the line "unsubscribe > >> >> ceph-devel" in the body of a message to > >> >> majordomo@xxxxxxxxxxxxxxx More majordomo info at > >> >> http://vger.kernel.org/majordomo-info.html > >> >> > >> >> > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@xxxxxxxxxxxxxx > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > >> > >> > >> > > > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com