Re: Failed assert when starting new OSDs in 0.60

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 29 Apr 2013 14:38:21 -0400

Hi Sam,

Thanks for being willing to take a look.

I applied the debug settings on one host that 3 out of 3 OSDs with this problem.  Then tried to start them up.  Here are the resulting logs:

https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz

 - Travis

On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:

You appear to be missing pg metadata for some reason.  If you can

reproduce it with

debug osd = 20

debug filestore = 20

debug ms = 1

on all of the OSDs, I should be able to track it down.

I created a bug: #4855.

Thanks!

-Sam

On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

> Thanks Greg.

>

> I quit playing with it because every time I restarted the cluster (service

> ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10, 3rd time

> 13...  All 13 down OSDs all show the same stacktrace.

>

>  - Travis

>

>

> On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

>>

>> This sounds vaguely familiar to me, and I see

>> http://tracker.ceph.com/issues/4052, which is marked as "Can't

>> reproduce" — I think maybe this is fixed in "next" and "master", but

>> I'm not sure. For more than that I'd have to defer to Sage or Sam.

>> -Greg

>> Software Engineer #42 @ http://inktank.com | http://ceph.com

>>

>>

>> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:

>> > Hey folks,

>> >

>> > I'm helping put together a new test/experimental cluster, and hit this

>> > today

>> > when bringing the cluster up for the first time (using mkcephfs).

>> >

>> > After doing the normal "service ceph -a start", I noticed one OSD was

>> > down,

>> > and a lot of PGs were stuck creating.  I tried restarting the down OSD,

>> > but

>> > it would come up.  It always had this error:

>> >

>> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot

>> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In function

>> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,

>> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27 18:11:56.399089

>> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)

>> >

>> >  ceph version 0.60-401-g17a3859

>> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)

>> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,

>> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]

>> >  2: (OSD::load_pgs()+0x357) [0x28cba0]

>> >  3: (OSD::init()+0x741) [0x290a16]

>> >  4: (main()+0x1427) [0x2155c0]

>> >  5: (__libc_start_main()+0x99) [0xb69bcf42]

>> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is

>> > needed to

>> > interpret this.

>> >

>> >

>> > I then did a full cluster restart, and now I have ten OSDs down -- each

>> > showing the same exception/failed assert.

>> >

>> > Anybody seen this?

>> >

>> > I know I'm running a weird version -- it's compiled from source, and was

>> > provided to me.  The OSDs are all on ARM, and the mon is x86_64.  Just

>> > looking to see if anyone has seen this particular stack trace of

>> > load_pgs()/peek_map_epoch() before....

>> >

>> >  - Travis

>> >

>> > _______________________________________________

>> > ceph-users mailing list

>> > ceph-users@xxxxxxxxxxxxxx

>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >

>

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com