Re: OSD assert fail

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 7 Sep 2010 10:08:17 -0700 (PDT)

On Tue, 7 Sep 2010, Leander Yu wrote:

> Hi Sage,
> We have 3 osd failed first, two of them have the same log as osd3 ,
> the other one is osd 7.
> We just use it as a file system, I wrote a script to repeatedly write
> a 10G file and delete it.
> The only unusual thing I saw in the log was the journal file was
> fulled before the code trace log.
> 
> The other thing is that when I try to restart the failed osd, more osd
> crashed on other node. However we didn't get the core dump :(
> Thanks.

Are you able to reproduce the osd3 crash (FAILED 
assert(caller_ops.count(e.reqid) == 0)) by restarting osds?  If so, can 
you do so after adding

	debug ms = 1
	debug osd = 20
	debug filestore = 20

to your [osd] section of ceph.conf, and sending the osd log somewhere 
(via url,  private email, whatever)?

Thanks!
sage

> 
> Regards,
> Leander Yu.
> 
> 
> On Tue, Sep 7, 2010 at 2:42 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > Hi,
> >
> > This is one we've seen before, issue #326
> >
> >        http://tracker.newdream.net/issues/326
> >
> > Was that the first (and only?) osd to fail?
> >
> > What kind of workload were you subjecting the cluster to?  Just the file
> > system?  RBD?  Anything unusual?
> >
> > Also, can you confirm what version of the code you were running?  The osd
> > log at /var/log/ceph/osd.*.log should have a version number and sha1 id,
> > something like
> >
> > ceph version 0.22~rc (3cd9d853cd58c79dc12427be8488e57970abda04)
> >
> > Thanks!
> > sage
> >
> >
> > On Mon, 6 Sep 2010, Leander Yu wrote:
> >
> >> Hi all,
> >> I have setup a 10 osd + 2 mds + 3 mon ceph cluster. it runs ok at
> >> beginning. However after one day, some of the osd  crashed with
> >> following assert fail
> >> I am using the unstable trunk. ceph.conf is attached.
> >>
> >> -------------- osd 3 -----------------
> >> osd/PG.h: In function 'void PG::IndexedLog::index(PG::Log::Entry&)':
> >> osd/PG.h:429: FAILED assert(caller_ops.count(e.reqid) == 0)
> >>  1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&,
> >> PG::Missing&, std::map<int, MOSDPGInfo*, std::less<int>,
> >> std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0xb06)
> >> [0x4cf426]
> >>  2: (OSD::handle_pg_log(MOSDPGLog*)+0xa9) [0x4cf999]
> >>  3: (OSD::_dispatch(Message*)+0x3ed) [0x4e7dfd]
> >>  4: (OSD::ms_dispatch(Message*)+0x39) [0x4e86c9]
> >>  5: (SimpleMessenger::dispatch_entry()+0x789) [0x46b5f9]
> >>  6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45849c]
> >>  7: (Thread::_entry_func(void*)+0xa) [0x46c0ca]
> >>  8: (()+0x6a3a) [0x7f69fd39ea3a]
> >>  9: (clone()+0x6d) [0x7f69fc5bc77d]
> >>
> >> -------------- osd 7 --------------------
> >> osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_pull(MOSDSubOp*)':
> >> osd/ReplicatedPG.cc:3021: FAILED assert(r == 0)
> >>  1: (OSD::dequeue_op(PG*)+0x344) [0x4e6fd4]
> >>  2: (ThreadPool::worker()+0x28f) [0x5b5a9f]
> >>  3: (ThreadPool::WorkThread::entry()+0xd) [0x4f0acd]
> >>  4: (Thread::_entry_func(void*)+0xa) [0x46c0ca]
> >>  5: (()+0x6a3a) [0x7efff4f12a3a]
> >>  6: (clone()+0x6d) [0x7efff413077d]
> >>
> >> Please let me if you need more information. I still keep the
> >> environment for collecting more data for debug.
> >>
> >> Thanks.
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>