Re: Failed assert when starting new OSDs in 0.60

"Mr. NPP" <mr.npp@xxxxxxxxxxxxxxxxxxx> · Tue, 30 Apr 2013 23:54:56 -0700

I'm getting the same issue with one of my OSD's.
Calculating dependencies... done!
[ebuild   R   ~] app-arch/snappy-1.1.0  USE="-static-libs" 0 kB

[ebuild   R   ~] dev-libs/leveldb-1.9.0-r5  USE="snappy -static-libs" 0 kB
[ebuild   R   ~] sys-cluster/ceph-0.60-r1  USE="-debug -fuse -gtk -libatomic -radosgw -static-libs -tcmalloc" 0 kB

below is my log
https://docs.google.com/file/d/0BwQnRodV8Actd2NQT25FSnA2cjg/edit?usp=sharing

thanks
mr.npp

On Tue, Apr 30, 2013 at 9:17 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

On the OSD node:

root@cepha0:~# lsb_release -a
No LSB modules are available.

Distributor ID:    Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:    quantal

root@cepha0:~# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

||/ Name                                   Version                  Architecture             Description
+++-======================================-========================-========================-==================================================================================

ii  libleveldb1:armhf                      0+20120530.gitdd0d562-2  armhf                    fast key-value storage library
root@cepha0:~# uname -a
Linux cepha0 3.5.0-27-highbank #46-Ubuntu SMP Mon Mar 25 23:19:40 UTC 2013 armv7l armv7l armv7l GNU/Linux

On the MON node:
# lsb_release -a
No LSB modules are available.
Distributor ID:    Ubuntu
Description:    Ubuntu 12.10
Release:    12.10
Codename:    quantal
# uname -a
Linux  3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

# dpkg -l "*leveldb*"
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)

||/ Name                                   Version                  Architecture             Description
+++-======================================-========================-========================-==================================================================================

un  leveldb-doc                            <none>                                            (no description available)
ii  libleveldb-dev:amd64                   0+20120530.gitdd0d562-2  amd64                    fast key-value storage library (development files)

ii  libleveldb1:amd64                      0+20120530.gitdd0d562-2  amd64                    fast key-value storage library

On Tue, Apr 30, 2013 at 12:11 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:

What version of leveldb is installed?  Ubuntu/version?

-Sam

On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

> Interestingly, the down OSD does not get marked out after 5 minutes.

> Probably that is already fixed by http://tracker.ceph.com/issues/4822.

>

>

> On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

>>

>> Hi Sam,

>>

>> I was prepared to write in and say that the problem had gone away.  I

>> tried restarting several OSDs last night in the hopes of capturing the

>> problem on and OSD that hadn't failed yet, but didn't have any luck.  So I

>> did indeed re-create the cluster from scratch (using mkcephfs), and what do

>> you know -- everything worked.  I got everything in a nice stable state,

>> then decided to do a full cluster restart, just to be sure.  Sure enough,

>> one OSD failed to come up, and has the same stack trace.  So I believe I

>> have the log you want -- just from the OSD that failed, right?

>>

>> Question -- any feeling for what parts of the log you need?  It's 688MB

>> uncompressed (two hours!), so I'd like to be able to trim some off for you

>> before making it available.  Do you only need/want the part from after the

>> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and

>> you need some before that?  If you are fine with that large of a file, I can

>> just make that available too.  Let me know.

>>

>>  - Travis

>>

>>

>> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

>>>

>>> Hi Sam,

>>>

>>> No problem, I'll leave that debugging turned up high, and do a mkcephfs

>>> from scratch and see what happens.  Not sure if it will happen again or not.

>>> =)

>>>

>>> Thanks again.

>>>

>>>  - Travis

>>>

>>>

>>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <sam.just@xxxxxxxxxxx>

>>> wrote:

>>>>

>>>> Hmm, I need logging from when the corruption happened.  If this is

>>>> reproducible, can you enable that logging on a clean osd (or better, a

>>>> clean cluster) until the assert occurs?

>>>> -Sam

>>>>

>>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <trhoden@xxxxxxxxx>

>>>> wrote:

>>>> > Also, I can note that it does not take a full cluster restart to

>>>> > trigger

>>>> > this.  If I just restart an OSD that was up/in previously, the same

>>>> > error

>>>> > can happen (though not every time).  So restarting OSD's for me is a

>>>> > bit

>>>> > like Russian roullette.  =)  Even though restarting an OSD may not

>>>> > also

>>>> > result in the error, it seems that once it happens that OSD is gone

>>>> > for

>>>> > good.  No amount of restart has brought any of the dead ones back.

>>>> >

>>>> > I'd really like to get to the bottom of it.  Let me know if I can do

>>>> > anything to help.

>>>> >

>>>> > I may also have to try completely wiping/rebuilding to see if I can

>>>> > make

>>>> > this thing usable.

>>>> >

>>>> >

>>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trhoden@xxxxxxxxx>

>>>> > wrote:

>>>> >>

>>>> >> Hi Sam,

>>>> >>

>>>> >> Thanks for being willing to take a look.

>>>> >>

>>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with

>>>> >> this

>>>> >> problem.  Then tried to start them up.  Here are the resulting logs:

>>>> >>

>>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz

>>>> >>

>>>> >>  - Travis

>>>> >>

>>>> >>

>>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.just@xxxxxxxxxxx>

>>>> >> wrote:

>>>> >>>

>>>> >>> You appear to be missing pg metadata for some reason.  If you can

>>>> >>> reproduce it with

>>>> >>> debug osd = 20

>>>> >>> debug filestore = 20

>>>> >>> debug ms = 1

>>>> >>> on all of the OSDs, I should be able to track it down.

>>>> >>>

>>>> >>> I created a bug: #4855.

>>>> >>>

>>>> >>> Thanks!

>>>> >>> -Sam

>>>> >>>

>>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trhoden@xxxxxxxxx>

>>>> >>> wrote:

>>>> >>> > Thanks Greg.

>>>> >>> >

>>>> >>> > I quit playing with it because every time I restarted the cluster

>>>> >>> > (service

>>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,

>>>> >>> > 3rd

>>>> >>> > time

>>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.

>>>> >>> >

>>>> >>> >  - Travis

>>>> >>> >

>>>> >>> >

>>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum

>>>> >>> > <greg@xxxxxxxxxxx>

>>>> >>> > wrote:

>>>> >>> >>

>>>> >>> >> This sounds vaguely familiar to me, and I see

>>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't

>>>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master",

>>>> >>> >> but

>>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or

>>>> >>> >> Sam.

>>>> >>> >> -Greg

>>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com

>>>> >>> >>

>>>> >>> >>

>>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden

>>>> >>> >> <trhoden@xxxxxxxxx>

>>>> >>> >> wrote:

>>>> >>> >> > Hey folks,

>>>> >>> >> >

>>>> >>> >> > I'm helping put together a new test/experimental cluster, and

>>>> >>> >> > hit

>>>> >>> >> > this

>>>> >>> >> > today

>>>> >>> >> > when bringing the cluster up for the first time (using

>>>> >>> >> > mkcephfs).

>>>> >>> >> >

>>>> >>> >> > After doing the normal "service ceph -a start", I noticed one

>>>> >>> >> > OSD

>>>> >>> >> > was

>>>> >>> >> > down,

>>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the

>>>> >>> >> > down

>>>> >>> >> > OSD,

>>>> >>> >> > but

>>>> >>> >> > it would come up.  It always had this error:

>>>> >>> >> >

>>>> >>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot

>>>> >>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In

>>>> >>> >> > function

>>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,

>>>> >>> >> > hobject_t&,

>>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27

>>>> >>> >> > 18:11:56.399089

>>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)

>>>> >>> >> >

>>>> >>> >> >  ceph version 0.60-401-g17a3859

>>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)

>>>> >>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,

>>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]

>>>> >>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]

>>>> >>> >> >  3: (OSD::init()+0x741) [0x290a16]

>>>> >>> >> >  4: (main()+0x1427) [0x2155c0]

>>>> >>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]

>>>> >>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>`

>>>> >>> >> > is

>>>> >>> >> > needed to

>>>> >>> >> > interpret this.

>>>> >>> >> >

>>>> >>> >> >

>>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs down

>>>> >>> >> > --

>>>> >>> >> > each

>>>> >>> >> > showing the same exception/failed assert.

>>>> >>> >> >

>>>> >>> >> > Anybody seen this?

>>>> >>> >> >

>>>> >>> >> > I know I'm running a weird version -- it's compiled from

>>>> >>> >> > source, and

>>>> >>> >> > was

>>>> >>> >> > provided to me.  The OSDs are all on ARM, and the mon is

>>>> >>> >> > x86_64.

>>>> >>> >> > Just

>>>> >>> >> > looking to see if anyone has seen this particular stack trace

>>>> >>> >> > of

>>>> >>> >> > load_pgs()/peek_map_epoch() before....

>>>> >>> >> >

>>>> >>> >> >  - Travis

>>>> >>> >> >

>>>> >>> >> > _______________________________________________

>>>> >>> >> > ceph-users mailing list

>>>> >>> >> > ceph-users@xxxxxxxxxxxxxx

>>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>> >>> >> >

>>>> >>> >

>>>> >>> >

>>>> >>

>>>> >>

>>>> >

>>>

>>>

>>

>

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com