Re: Failed assert when starting new OSDs in 0.60

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 30 Apr 2013 09:11:59 -0700



What version of leveldb is installed?  Ubuntu/version?
-Sam

On Tue, Apr 30, 2013 at 8:50 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> Interestingly, the down OSD does not get marked out after 5 minutes.
> Probably that is already fixed by http://tracker.ceph.com/issues/4822.
>
>
> On Tue, Apr 30, 2013 at 11:42 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>
>> Hi Sam,
>>
>> I was prepared to write in and say that the problem had gone away.  I
>> tried restarting several OSDs last night in the hopes of capturing the
>> problem on and OSD that hadn't failed yet, but didn't have any luck.  So I
>> did indeed re-create the cluster from scratch (using mkcephfs), and what do
>> you know -- everything worked.  I got everything in a nice stable state,
>> then decided to do a full cluster restart, just to be sure.  Sure enough,
>> one OSD failed to come up, and has the same stack trace.  So I believe I
>> have the log you want -- just from the OSD that failed, right?
>>
>> Question -- any feeling for what parts of the log you need?  It's 688MB
>> uncompressed (two hours!), so I'd like to be able to trim some off for you
>> before making it available.  Do you only need/want the part from after the
>> OSD was restarted?  Or perhaps the corruption happens on OSD shutdown and
>> you need some before that?  If you are fine with that large of a file, I can
>> just make that available too.  Let me know.
>>
>>  - Travis
>>
>>
>> On Mon, Apr 29, 2013 at 6:26 PM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>>>
>>> Hi Sam,
>>>
>>> No problem, I'll leave that debugging turned up high, and do a mkcephfs
>>> from scratch and see what happens.  Not sure if it will happen again or not.
>>> =)
>>>
>>> Thanks again.
>>>
>>>  - Travis
>>>
>>>
>>> On Mon, Apr 29, 2013 at 5:51 PM, Samuel Just <sam.just@xxxxxxxxxxx>
>>> wrote:
>>>>
>>>> Hmm, I need logging from when the corruption happened.  If this is
>>>> reproducible, can you enable that logging on a clean osd (or better, a
>>>> clean cluster) until the assert occurs?
>>>> -Sam
>>>>
>>>> On Mon, Apr 29, 2013 at 2:45 PM, Travis Rhoden <trhoden@xxxxxxxxx>
>>>> wrote:
>>>> > Also, I can note that it does not take a full cluster restart to
>>>> > trigger
>>>> > this.  If I just restart an OSD that was up/in previously, the same
>>>> > error
>>>> > can happen (though not every time).  So restarting OSD's for me is a
>>>> > bit
>>>> > like Russian roullette.  =)  Even though restarting an OSD may not
>>>> > also
>>>> > result in the error, it seems that once it happens that OSD is gone
>>>> > for
>>>> > good.  No amount of restart has brought any of the dead ones back.
>>>> >
>>>> > I'd really like to get to the bottom of it.  Let me know if I can do
>>>> > anything to help.
>>>> >
>>>> > I may also have to try completely wiping/rebuilding to see if I can
>>>> > make
>>>> > this thing usable.
>>>> >
>>>> >
>>>> > On Mon, Apr 29, 2013 at 2:38 PM, Travis Rhoden <trhoden@xxxxxxxxx>
>>>> > wrote:
>>>> >>
>>>> >> Hi Sam,
>>>> >>
>>>> >> Thanks for being willing to take a look.
>>>> >>
>>>> >> I applied the debug settings on one host that 3 out of 3 OSDs with
>>>> >> this
>>>> >> problem.  Then tried to start them up.  Here are the resulting logs:
>>>> >>
>>>> >> https://dl.dropboxusercontent.com/u/23122069/cephlogs.tgz
>>>> >>
>>>> >>  - Travis
>>>> >>
>>>> >>
>>>> >> On Mon, Apr 29, 2013 at 1:04 PM, Samuel Just <sam.just@xxxxxxxxxxx>
>>>> >> wrote:
>>>> >>>
>>>> >>> You appear to be missing pg metadata for some reason.  If you can
>>>> >>> reproduce it with
>>>> >>> debug osd = 20
>>>> >>> debug filestore = 20
>>>> >>> debug ms = 1
>>>> >>> on all of the OSDs, I should be able to track it down.
>>>> >>>
>>>> >>> I created a bug: #4855.
>>>> >>>
>>>> >>> Thanks!
>>>> >>> -Sam
>>>> >>>
>>>> >>> On Mon, Apr 29, 2013 at 9:52 AM, Travis Rhoden <trhoden@xxxxxxxxx>
>>>> >>> wrote:
>>>> >>> > Thanks Greg.
>>>> >>> >
>>>> >>> > I quit playing with it because every time I restarted the cluster
>>>> >>> > (service
>>>> >>> > ceph -a restart), I lost more OSDs..  First time it was 1, 2nd 10,
>>>> >>> > 3rd
>>>> >>> > time
>>>> >>> > 13...  All 13 down OSDs all show the same stacktrace.
>>>> >>> >
>>>> >>> >  - Travis
>>>> >>> >
>>>> >>> >
>>>> >>> > On Mon, Apr 29, 2013 at 11:56 AM, Gregory Farnum
>>>> >>> > <greg@xxxxxxxxxxx>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> This sounds vaguely familiar to me, and I see
>>>> >>> >> http://tracker.ceph.com/issues/4052, which is marked as "Can't
>>>> >>> >> reproduce" — I think maybe this is fixed in "next" and "master",
>>>> >>> >> but
>>>> >>> >> I'm not sure. For more than that I'd have to defer to Sage or
>>>> >>> >> Sam.
>>>> >>> >> -Greg
>>>> >>> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>> >>> >>
>>>> >>> >>
>>>> >>> >> On Sat, Apr 27, 2013 at 6:43 PM, Travis Rhoden
>>>> >>> >> <trhoden@xxxxxxxxx>
>>>> >>> >> wrote:
>>>> >>> >> > Hey folks,
>>>> >>> >> >
>>>> >>> >> > I'm helping put together a new test/experimental cluster, and
>>>> >>> >> > hit
>>>> >>> >> > this
>>>> >>> >> > today
>>>> >>> >> > when bringing the cluster up for the first time (using
>>>> >>> >> > mkcephfs).
>>>> >>> >> >
>>>> >>> >> > After doing the normal "service ceph -a start", I noticed one
>>>> >>> >> > OSD
>>>> >>> >> > was
>>>> >>> >> > down,
>>>> >>> >> > and a lot of PGs were stuck creating.  I tried restarting the
>>>> >>> >> > down
>>>> >>> >> > OSD,
>>>> >>> >> > but
>>>> >>> >> > it would come up.  It always had this error:
>>>> >>> >> >
>>>> >>> >> >     -1> 2013-04-27 18:11:56.179804 b6fcd000  2 osd.1 0 boot
>>>> >>> >> >      0> 2013-04-27 18:11:56.402161 b6fcd000 -1 osd/PG.cc: In
>>>> >>> >> > function
>>>> >>> >> > 'static epoch_t PG::peek_map_epoch(ObjectStore*, coll_t,
>>>> >>> >> > hobject_t&,
>>>> >>> >> > ceph::bufferlist*)' thread b6fcd000 time 2013-04-27
>>>> >>> >> > 18:11:56.399089
>>>> >>> >> > osd/PG.cc: 2556: FAILED assert(values.size() == 1)
>>>> >>> >> >
>>>> >>> >> >  ceph version 0.60-401-g17a3859
>>>> >>> >> > (17a38593d60f5f29b9b66c13c0aaa759762c6d04)
>>>> >>> >> >  1: (PG::peek_map_epoch(ObjectStore*, coll_t, hobject_t&,
>>>> >>> >> > ceph::buffer::list*)+0x1ad) [0x2c3c0a]
>>>> >>> >> >  2: (OSD::load_pgs()+0x357) [0x28cba0]
>>>> >>> >> >  3: (OSD::init()+0x741) [0x290a16]
>>>> >>> >> >  4: (main()+0x1427) [0x2155c0]
>>>> >>> >> >  5: (__libc_start_main()+0x99) [0xb69bcf42]
>>>> >>> >> >  NOTE: a copy of the executable, or `objdump -rdS <executable>`
>>>> >>> >> > is
>>>> >>> >> > needed to
>>>> >>> >> > interpret this.
>>>> >>> >> >
>>>> >>> >> >
>>>> >>> >> > I then did a full cluster restart, and now I have ten OSDs down
>>>> >>> >> > --
>>>> >>> >> > each
>>>> >>> >> > showing the same exception/failed assert.
>>>> >>> >> >
>>>> >>> >> > Anybody seen this?
>>>> >>> >> >
>>>> >>> >> > I know I'm running a weird version -- it's compiled from
>>>> >>> >> > source, and
>>>> >>> >> > was
>>>> >>> >> > provided to me.  The OSDs are all on ARM, and the mon is
>>>> >>> >> > x86_64.
>>>> >>> >> > Just
>>>> >>> >> > looking to see if anyone has seen this particular stack trace
>>>> >>> >> > of
>>>> >>> >> > load_pgs()/peek_map_epoch() before....
>>>> >>> >> >
>>>> >>> >> >  - Travis
>>>> >>> >> >
>>>> >>> >> > _______________________________________________
>>>> >>> >> > ceph-users mailing list
>>>> >>> >> > ceph-users@xxxxxxxxxxxxxx
>>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >>> >> >
>>>> >>> >
>>>> >>> >
>>>> >>
>>>> >>
>>>> >
>>>
>>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com