Re: Upgrade from Giant to Hammer and after some basic operations most of the OSD's went down

"Tuomas Juntunen" <tuomas.juntunen@xxxxxxxxxxxxxxx> · Mon, 4 May 2015 07:11:35 +0300

Hi

Thanks Sage, I got it working now. Everything else seems to be ok, except
mds is reporting "mds cluster is degraded", not sure what could be wrong.
Mds is running and all osds are up and pg's are active+clean and
active+clean+replay.

Had to delete some empty pools which were created while the osd's were not
working and recovery started to go through.

Seems mds is not that stable, this isn't the first time it goes degraded.
Before it just started to work, but now I just can't get it back working.

Thanks

Br,
Tuomas

-----Original Message-----
From: tuomas.juntunen@xxxxxxxxxxxxxxx
[mailto:tuomas.juntunen@xxxxxxxxxxxxxxx] 
Sent: 1. toukokuuta 2015 21:14
To: Sage Weil
Cc: tuomas.juntunen; ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re:  Upgrade from Giant to Hammer and after some basic
operations most of the OSD's went down

Thanks, I'll do this when the commit is available and report back.

And indeed, I'll change to the official ones after everything is ok.

Br,
Tuomas

> On Fri, 1 May 2015, tuomas.juntunen@xxxxxxxxxxxxxxx wrote:
>> Hi
>>
>> I deleted the images and img pools and started osd's, they still die.
>>
>> Here's a log of one of the osd's after this, if you need it.
>>
>> http://beta.xaasbox.com/ceph/ceph-osd.19.log
>
> I've pushed another commit that should avoid this case, sha1 
> 425bd4e1dba00cc2243b0c27232d1f9740b04e34.
>
> Note that once the pools are fully deleted (shouldn't take too long 
> once the osds are up and stabilize) you should switch back to the 
> normal packages that don't have these workarounds.
>
> sage
>
>
>
>>
>> Br,
>> Tuomas
>>
>>
>> > Thanks man. I'll try it tomorrow. Have a good one.
>> >
>> > Br,T
>> >
>> > -------- Original message --------
>> > From: Sage Weil <sage@xxxxxxxxxxxx>
>> > Date: 30/04/2015  18:23  (GMT+02:00)
>> > To: Tuomas Juntunen <tuomas.juntunen@xxxxxxxxxxxxxxx>
>> > Cc: ceph-users@xxxxxxxxxxxxxx, ceph-devel@xxxxxxxxxxxxxxx
>> > Subject: RE:  Upgrade from Giant to Hammer and after 
>> > some basic
>>
>> > operations most of the OSD's went down
>> >
>> > On Thu, 30 Apr 2015, tuomas.juntunen@xxxxxxxxxxxxxxx wrote:
>> >> Hey
>> >>
>> >> Yes I can drop the images data, you think this will fix it?
>> >
>> > It's a slightly different assert that (I believe) should not 
>> > trigger once the pool is deleted.Â  Please give that a try and if 
>> > you still hit it I'll whip up a workaround.
>> >
>> > Thanks!
>> > sage
>> >
>> >  >
>> >>
>> >> Br,
>> >>
>> >> Tuomas
>> >>
>> >> > On Wed, 29 Apr 2015, Tuomas Juntunen wrote:
>> >> >> Hi
>> >> >>
>> >> >> I updated that version and it seems that something did happen, 
>> >> >> the osd's stayed up for a while and 'ceph status' got updated. 
>> >> >> But then in couple
>> of
>> >> >> minutes, they all went down the same way.
>> >> >>
>> >> >> I have attached new 'ceph osd dump -f json-pretty' and got a 
>> >> >> new log
>> from
>> >> >> one of the osd's with osd debug = 20, 
>> >> >> http://beta.xaasbox.com/ceph/ceph-osd.15.log
>> >> >
>> >> > Sam mentioned that you had said earlier that this was not critical
data?
>> >> > If not, I think the simplest thing is to just drop those pools.Â  
>> >> > The important thing (from my perspective at least :) is that we 
>> >> > understand
>> the
>> >> > root cause and can prevent this in the future.
>> >> >
>> >> > sage
>> >> >
>> >> >
>> >> >>
>> >> >> Thank you!
>> >> >>
>> >> >> Br,
>> >> >> Tuomas
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>> >> >> Sent: 28. huhtikuuta 2015 23:57
>> >> >> To: Tuomas Juntunen
>> >> >> Cc: ceph-users@xxxxxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
>> >> >> Subject: Re:  Upgrade from Giant to Hammer and 
>> >> >> after some
>> basic
>> >> >> operations most of the OSD's went down
>> >> >>
>> >> >> Hi Tuomas,
>> >> >>
>> >> >> I've pushed an updated wip-hammer-snaps branch.Â  Can you please
try it?
>> >> >> The build will appear here
>> >> >>
>> >> >>
>> >> >> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/sha1/08
>> >> >> bf531331afd5e
>> >> >> 2eb514067f72afda11bcde286
>> >> >>
>> >> >> (or a similar url; adjust for your distro).
>> >> >>
>> >> >> Thanks!
>> >> >> sage
>> >> >>
>> >> >>
>> >> >> On Tue, 28 Apr 2015, Sage Weil wrote:
>> >> >>
>> >> >> > [adding ceph-devel]
>> >> >> >
>> >> >> > Okay, I see the problem.Â  This seems to be unrelated ot the 
>> >> >> > giant -> hammer move... it's a result of the tiering changes you
made:
>> >> >> >
>> >> >> > > > > > > > The following:
>> >> >> > > > > > > >
>> >> >> > > > > > > > ceph osd tier add img images --force-nonempty 
>> >> >> > > > > > > > ceph osd tier cache-mode images forward ceph osd 
>> >> >> > > > > > > > tier set-overlay img images
>> >> >> >
>> >> >> > Specifically, --force-nonempty bypassed important safety checks.
>> >> >> >
>> >> >> > 1. images had snapshots (and removed_snaps)
>> >> >> >
>> >> >> > 2. images was added as a tier *of* img, and img's 
>> >> >> > removed_snaps was copied to images, clobbering the 
>> >> >> > removed_snaps value (see
>> >> >> > OSDMap::Incremental::propagate_snaps_to_tiers)
>> >> >> >
>> >> >> > 3. tiering relation was undone, but removed_snaps was still 
>> >> >> > gone
>> >> >> >
>> >> >> > 4. on OSD startup, when we load the PG, removed_snaps is 
>> >> >> > initialized with the older map.Â  later, in PGPool::update(), 
>> >> >> > we assume that removed_snaps alwasy grows (never shrinks) and we
trigger an assert.
>> >> >> >
>> >> >> > To fix this I think we need to do 2 things:
>> >> >> >
>> >> >> > 1. make the OSD forgiving out removed_snaps getting smaller.Â  
>> >> >> > This is probably a good thing anyway: once we know snaps are 
>> >> >> > removed on all OSDs we can prune the interval_set in the
OSDMap.Â  Maybe.
>> >> >> >
>> >> >> > 2. Fix the mon to prevent this from happening, *even* when 
>> >> >> > --force-nonempty is specified.Â  (This is the root cause.)
>> >> >> >
>> >> >> > I've opened http://tracker.ceph.com/issues/11493 to track this.
>> >> >> >
>> >> >> > sage
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Idea was to make images as a tier to img, move 
>> >> >> > > > > > > > data to img then change
>> >> >> > > > > > > clients to use the new img pool.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Br,
>> >> >> > > > > > > > Tuomas
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Can you explain exactly what you mean by:
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > "Also I created one pool for tier to be able to 
>> >> >> > > > > > > > > move data without
>> >> >> > > > > > > outage."
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > -Sam
>> >> >> > > > > > > > > ----- Original Message -----
>> >> >> > > > > > > > > From: "tuomas juntunen"
>> >> >> > > > > > > > > <tuomas.juntunen@xxxxxxxxxxxxxxx>
>> >> >> > > > > > > > > To: "Ian Colle" <icolle@xxxxxxxxxx>
>> >> >> > > > > > > > > Cc: ceph-users@xxxxxxxxxxxxxx
>> >> >> > > > > > > > > Sent: Monday, April 27, 2015 4:23:44 AM
>> >> >> > > > > > > > > Subject: Re:  Upgrade from Giant to 
>> >> >> > > > > > > > > Hammer and after some basic operations most of 
>> >> >> > > > > > > > > the OSD's went down
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Hi
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Any solution for this yet?
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Br,
>> >> >> > > > > > > > > Tuomas
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >> It looks like you may have hit
>> >> >> > > > > > > > >> http://tracker.ceph.com/issues/7915
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Ian R. Colle
>> >> >> > > > > > > > >> Global Director of Software Engineering Red 
>> >> >> > > > > > > > >> Hat (Inktank is now part of Red Hat!) 
>> >> >> > > > > > > > >> http://www.linkedin.com/in/ircolle
>> >> >> > > > > > > > >> http://www.twitter.com/ircolle
>> >> >> > > > > > > > >> Cell: +1.303.601.7713
>> >> >> > > > > > > > >> Email: icolle@xxxxxxxxxx
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> ----- Original Message -----
>> >> >> > > > > > > > >> From: "tuomas juntunen"
>> >> >> > > > > > > > >> <tuomas.juntunen@xxxxxxxxxxxxxxx>
>> >> >> > > > > > > > >> To: ceph-users@xxxxxxxxxxxxxx
>> >> >> > > > > > > > >> Sent: Monday, April 27, 2015 1:56:29 PM
>> >> >> > > > > > > > >> Subject:  Upgrade from Giant to 
>> >> >> > > > > > > > >> Hammer and after some basic operations most of 
>> >> >> > > > > > > > >> the OSD's went down
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> I upgraded Ceph from 0.87 Giant to 0.94.1 
>> >> >> > > > > > > > >> Hammer
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Then created new pools and deleted some old 
>> >> >> > > > > > > > >> ones. Also I created one pool for tier to be 
>> >> >> > > > > > > > >> able to move data without
>> >> >> > > outage.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> After these operations all but 10 OSD's are 
>> >> >> > > > > > > > >> down and creating this kind of messages to 
>> >> >> > > > > > > > >> logs, I get more than 100gb of these in a
>> >> >> > > > > > night:
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>Â  -19> 2015-04-27 10:17:08.808584 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> osd.23
>> >> >> > > pg_epoch:
>> >> >> > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started Â Â Â  -18> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.808596 7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Start Â Â Â  -17> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.808608 7fd8e748d700Â  1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >> local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] state<Start>: 
>> >> >> > > > > > > > >> transitioning to
>> Stray
>> >> >> > > > > > > > >>Â Â Â  -16> 2015-04-27 10:17:08.808621 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] exit Start 0.000025 0 
>> >> >> > > > > > > > >>0.000000 Â Â Â  -15> 2015-04-27 10:17:08.808637 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[0.189( v 8480'7 (0'0,8480'7] 
>> >> >> > > > > > > > >>local-les=16609
>> >> >> > > > > > > > >> n=0
>> >> >> > > > > > > > >> ec=1 les/c
>> >> >> > > > > > > > >> 16609/16659
>> >> >> > > > > > > > >> 16590/16590/16590) [24,3,23] r=2 lpr=17838
>> >> >> > > > > > > > >> pi=15659-16589/42
>> >> >> > > > > > > > >> crt=8480'7 lcod
>> >> >> > > > > > > > >> 0'0 inactive NOTIFY] enter Started/Stray Â Â Â  
>> >> >> > > > > > > > >>-14> 2015-04-27 10:17:08.808796 7fd8e748d700Â  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Reset 0.119467 4 
>> >> >> > > > > > > > >>0.000037 Â Â Â  -13> 2015-04-27 10:17:08.808817 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started Â Â Â  
>> >> >> > > > > > > > >>-12> 2015-04-27 10:17:08.808828 7fd8e748d700Â  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Start Â Â Â  
>> >> >> > > > > > > > >>-11> 2015-04-27 10:17:08.808838 7fd8e748d700Â  
>> >> >> > > > > > > > >>1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY]
>> >> >> > > > > > > > >> state<Start>: transitioning to Stray Â Â Â  
>> >> >> > > > > > > > >>-10> 2015-04-27 10:17:08.808849 7fd8e748d700Â  
>> >> >> > > > > > > > >>5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] exit Start 0.000020 0 
>> >> >> > > > > > > > >>0.000000 Â Â Â Â  -9> 2015-04-27 
>> >> >> > > > > > > > >>10:17:08.808861 7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[10.181( empty local-les=17879 n=0 
>> >> >> > > > > > > > >>ec=17863  les/c
>> >> >> > > > > > > > >> 17879/17879
>> >> >> > > > > > > > >> 17863/17863/17863) [25,5,23] r=2 lpr=17879 
>> >> >> > > > > > > > >>crt=0'0  inactive NOTIFY] enter Started/Stray Â 
>> >> >> > > > > > > > >>Â Â Â  -8> 2015-04-27 10:17:08.809427 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] exit Reset 7.511623 45 0.000165 
>> >> >> > > > > > > > >>Â Â Â Â  -7> 2015-04-27 10:17:08.809445 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started Â Â Â Â  -6> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.809456 7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Start Â Â Â Â  -5> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.809468 7fd8e748d700Â  1
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive]
>> >> >> > > > > > > > >> state<Start>: transitioning to Primary Â Â Â Â  
>> >> >> > > > > > > > >>-4> 2015-04-27 10:17:08.809479 7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] exit Start 0.000023 0 0.000000 Â 
>> >> >> > > > > > > > >>Â Â Â  -3> 2015-04-27 10:17:08.809492 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary Â Â Â Â  
>> >> >> > > > > > > > >>-2> 2015-04-27 10:17:08.809502 7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 inactive] enter Started/Primary/Peering Â 
>> >> >> > > > > > > > >>Â Â Â  -1> 2015-04-27 10:17:08.809513 
>> >> >> > > > > > > > >>7fd8e748d700Â  5
>> >> >> > > > > > > > >> osd.23
>> >> >> > > > pg_epoch:
>> >> >> > > > >
>> >> >> > > > > > > > >> 17882 pg[2.189( empty local-les=16127 n=0 ec=1 
>> >> >> > > > > > > > >>les/c
>> >> >> > > > > > > > >> 16127/16344
>> >> >> > > > > > > > >> 16125/16125/16125) [23,5] r=0 lpr=17838 
>> >> >> > > > > > > > >>crt=0'0 mlcod
>> >> >> > > > > > > > >> 0'0 peering] enter 
>> >> >> > > > > > > > >>Started/Primary/Peering/GetInfo Â Â Â Â Â  0> 
>> >> >> > > > > > > > >>2015-04-27 10:17:08.813837 7fd8e748d700 -1
>> >> >> > > > > > > ./include/interval_set.h:
>> >> >> > > > > > > > >> In
>> >> >> > > > > > > > >> function 'void interval_set<T>::erase(T, T) 
>> >> >> > > > > > > > >> [with T =
>> >> >> > > snapid_t]'
>> >> >> > > > > > > > >> thread
>> >> >> > > > > > > > >> 7fd8e748d700 time 2015-04-27 10:17:08.809899
>> >> >> > > > > > > > >> ./include/interval_set.h: 385: FAILED 
>> >> >> > > > > > > > >> assert(_size >=
>> >> >> > > > > > > > >> 0)
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>Â  ceph version 0.94.1
>> >> >> > > > > > > > >> (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>> >> >> > > > > > > > >>Â  1: (ceph::__ceph_assert_fail(char const*, 
>> >> >> > > > > > > > >>char
>> const*,
>> >> >> > > > > > > > >> int, char
>> >> >> > > > > > > > >> const*)+0x8b)
>> >> >> > > > > > > > >> [0xbc271b]
>> >> >> > > > > > > > >>Â  2:
>> >> >> > > > > > > > >> 
>> >> >> > > > > > > > >>(interval_set<snapid_t>::subtract(interval_set<
>> >> >> > > > > > > > >>snapid_t
>> >> >> > > > > > > > >> >
>> >> >> > > > > > > > >> const&)+0xb0) [0x82cd50] Â  3: 
>> >> >> > > > > > > > >>(PGPool::update(std::tr1::shared_ptr<OSDMap
>> >> >> > > > > > > > >> const>)+0x52e) [0x80113e]
>> >> >> > > > > > > > >>Â  4:
>> (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap
>> >> >> > > > > > > > >> const>, std::tr1::shared_ptr<OSDMap const>, 
>> >> >> > > > > > > > >> const>std::vector<int,
>> >> >> > > > > > > > >> std::allocator<int> >&, int, std::vector<int, 
>> >> >> > > > > > > > >> std::allocator<int>
>> >> >> > > > > > > > >> >&, int, PG::RecoveryCtx*)+0x282) [0x801652]
>> >> >> > > > > > > > >>Â  5: (OSD::advance_pg(unsigned int, PG*,  
>> >> >> > > > > > > > >>ThreadPool::TPHandle&, PG::RecoveryCtx*,  
>> >> >> > > > > > > > >>std::set<boost::intrusive_ptr<PG>,
>> >> >> > > > > > > > >> std::less<boost::intrusive_ptr<PG> >,  
>> >> >> > > > > > > > >>std::allocator<boost::intrusive_ptr<PG> > 
>> >> >> > > > > > > > >>>*)+0x2c3)  [0x6b0e43] Â  6: 
>> >> >> > > > > > > > >>(OSD::process_peering_events(std::list<PG*,
>> >> >> > > > > > > > >> std::allocator<PG*>
>> >> >> > > > > > > > >> > const&,
>> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x21c) [0x6b191c] Â  7: 
>> >> >> > > > > > > > >>(OSD::PeeringWQ::_process(std::list<PG*,
>> >> >> > > > > > > > >> std::allocator<PG*>
>> >> >> > > > > > > > >> > const&,
>> >> >> > > > > > > > >> ThreadPool::TPHandle&)+0x18) [0x709278] Â  8:
>> (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e)
>> >> >> > > > > > > > >> [0xbb38ae]
>> >> >> > > > > > > > >>Â  9: (ThreadPool::WorkThread::entry()+0x10) 
>> >> >> > > > > > > > >>[0xbb4950] Â  10: (()+0x8182) [0x7fd906946182] 
>> >> >> > > > > > > > >>Â  11: (clone()+0x6d) [0x7fd904eb147d]
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Also by monitoring (ceph -w) I get the 
>> >> >> > > > > > > > >> following messages, also lots of
>> >> >> > > > > > > them.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> 2015-04-27 10:39:52.935812 mon.0 [INF]
from='client.?
>> >> >> > > > > > > 10.20.0.13:0/1174409'
>> >> >> > > > > > > > >> entity='osd.30' cmd=[{"prefix": "osd crush 
>> >> >> > > > > > > > >> create-or-move",
>> >> >> > > > "args":
>> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 30,
"weight":
>> >> 1.82}]:
>> >> >>
>> >> >> > > > > > > > >> dispatch
>> >> >> > > > > > > > >> 2015-04-27 10:39:53.297376 mon.0 [INF]
from='client.?
>> >> >> > > > > > > 10.20.0.13:0/1174483'
>> >> >> > > > > > > > >> entity='osd.26' cmd=[{"prefix": "osd crush 
>> >> >> > > > > > > > >> create-or-move",
>> >> >> > > > "args":
>> >> >> > > > > > > > >> ["host=ceph3", "root=default"], "id": 26,
"weight":
>> >> 1.82}]:
>> >> >>
>> >> >> > > > > > > > >> dispatch
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> This is a cluster of 3 nodes with 36 OSD's, 
>> >> >> > > > > > > > >> nodes are also mons and mds's to save servers. 
>> >> >> > > > > > > > >> All run Ubuntu
>> >> >> 14.04.2.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> I have pretty much tried everything I could think
of.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Restarting daemons doesn't help.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Any help would be appreciated. I can also 
>> >> >> > > > > > > > >> provide more logs if necessary. They just seem 
>> >> >> > > > > > > > >> to get pretty large in few
>> >> >> > > moments.
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> Thank you
>> >> >> > > > > > > > >> Tuomas
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >> ______________________________________________
>> >> >> > > > > > > > >> _ ceph-users mailing list 
>> >> >> > > > > > > > >> ceph-users@xxxxxxxxxxxxxx 
>> >> >> > > > > > > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-
>> >> >> > > > > > > > >> ceph.com
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >>
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > _______________________________________________
>> >> >> > > > > > > > > ceph-users mailing list 
>> >> >> > > > > > > > > ceph-users@xxxxxxxxxxxxxx 
>> >> >> > > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-c
>> >> >> > > > > > > > > eph.com
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > _______________________________________________
>> >> >> > > > > > > > ceph-users mailing list ceph-users@xxxxxxxxxxxxxx 
>> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
>> >> >> > > > > > > > h.com
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > _______________________________________________
>> >> >> > > > > > > > ceph-users mailing list ceph-users@xxxxxxxxxxxxxx 
>> >> >> > > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-cep
>> >> >> > > > > > > > h.com
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > >
>> >> >> > >
>> >> >> > >
>> >> >> > _______________________________________________
>> >> >> > ceph-users mailing list
>> >> >> > ceph-users@xxxxxxxxxxxxxx
>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe 
>> >> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
>> >> More majordomo info atÂ  
>> >> http://vger.kernel.org/majordomo-info.html
>> >>
>> >>
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com