Re: osd crash with object store set to newstore

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 5 Jun 2015 15:50:18 -0700 (PDT)

On Fri, 5 Jun 2015, Srikanth Madugundi wrote:
> Hi Sage,
> 
> Did you get a chance to look at the crash?

Not yet--I am still focusing on getting wip-temp (and other newstore 
prerequisite code) working before turning back to newstore.  I'll look at 
this once I get back to newstore... hopefully in the next week or so!

sage

> 
> Regards
> Srikanth
> 
> On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
> <srikanth.madugundi@xxxxxxxxx> wrote:
> > Hi Sage,
> >
> > I saw the crash again here is the output after adding the debug
> > message from wip-newstore-debuglist
> >
> >
> >    -31> 2015-06-03 20:28:18.864496 7fd95976b700 -1
> > newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
> > --.7fffffffffffffff.00000000.!!!0000000000000000.0000000000000000
> >
> >
> > Here is the id of the file I posted.
> >
> > ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804
> >
> > Let me know if you need anything else.
> >
> > Regards
> > Srikanth
> >
> >
> > On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
> > <srikanth.madugundi@xxxxxxxxx> wrote:
> >> Hi Sage,
> >>
> >> Unfortunately I purged the cluster yesterday and restarted the
> >> backfill tool. I did not see the osd crash yet on the cluster. I am
> >> monitoring the OSDs and will update you once I see the crash.
> >>
> >> With the new backfill run I have reduced the rps by half, not sure if
> >> this is the reason for not seeing the crash yet.
> >>
> >> Regards
> >> Srikanth
> >>
> >>
> >> On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>> I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
> >>> with that branch with 'debug newstore = 20' and send us the log?
> >>> (You can just do 'ceph-post-file <filename>'.)
> >>>
> >>> Thanks!
> >>> sage
> >>>
> >>> On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
> >>>
> >>>> Hi Sage,
> >>>>
> >>>> The assertion failed at line 1639, here is the log message
> >>>>
> >>>>
> >>>> 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
> >>>> function 'virtual int NewStore::collection_list_partial(coll_t,
> >>>> ghobject_t, int, int, snapid_t, std::vector<ghobject_t>*,
> >>>> ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
> >>>>
> >>>> os/newstore/NewStore.cc: 1639: FAILED assert(k >= start_key && k < end_key)
> >>>>
> >>>>
> >>>> Just before the crash the here are the debug statements printed by the
> >>>> method (collection_list_partial)
> >>>>
> >>>> 2015-05-30 22:49:23.607232 7f1681934700 15
> >>>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
> >>>> start -1/0//0/0 min/max 1024/1024 snap head
> >>>> 2015-05-30 22:49:23.607251 7f1681934700 20
> >>>> newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
> >>>> --.7fffffffffffffb4.00000000. to --.7fffffffffffffb4.08000000. and
> >>>> --.800000000000004b.00000000. to --.800000000000004b.08000000. start
> >>>> -1/0//0/0
> >>>>
> >>>>
> >>>> Regards
> >>>> Srikanth
> >>>>
> >>>> On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> >>>> > On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
> >>>> >> Hi Sage and all,
> >>>> >>
> >>>> >> I build ceph code from wip-newstore on RHEL7 and running performance
> >>>> >> tests to compare with filestore. After few hours of running the tests
> >>>> >> the osd daemons started to crash. Here is the stack trace, the osd
> >>>> >> crashes immediately after the restart. So I could not get the osd up
> >>>> >> and running.
> >>>> >>
> >>>> >> ceph version b8e22893f44979613738dfcdd40dada2b513118
> >>>> >> (eb8e22893f44979613738dfcdd40dada2b513118)
> >>>> >> 1: /usr/bin/ceph-osd() [0xb84652]
> >>>> >> 2: (()+0xf130) [0x7f915f84f130]
> >>>> >> 3: (gsignal()+0x39) [0x7f915e2695c9]
> >>>> >> 4: (abort()+0x148) [0x7f915e26acd8]
> >>>> >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
> >>>> >> 6: (()+0x5e946) [0x7f915eb6b946]
> >>>> >> 7: (()+0x5e973) [0x7f915eb6b973]
> >>>> >> 8: (()+0x5eb9f) [0x7f915eb6bb9f]
> >>>> >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> >>>> >> const*)+0x27a) [0xc84c5a]
> >>>> >> 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
> >>>> >> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*,
> >>>> >> ghobject_t*)+0x13c9) [0xa08639]
> >>>> >> 11: (PGBackend::objects_list_partial(hobject_t const&, int, int,
> >>>> >> snapid_t, std::vector<hobject_t, std::allocator<hobject_t> >*,
> >>>> >> hobject_t*)+0x352) [0x918a02]
> >>>> >> 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptr<OpRequest>)+0x1066) [0x8aa906]
> >>>> >> 13: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x1eb) [0x8cd06b]
> >>>> >> 14: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
> >>>> >> ThreadPool::TPHandle&)+0x68a) [0x85dbea]
> >>>> >> 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> >>>> >> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ed)
> >>>> >> [0x6c3f5d]
> >>>> >> 16: (OSD::ShardedOpWQ::_process(unsigned int,
> >>>> >> ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
> >>>> >> 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf]
> >>>> >> 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
> >>>> >> 19: (()+0x7df3) [0x7f915f847df3]
> >>>> >> 20: (clone()+0x6d) [0x7f915e32a01d]
> >>>> >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >>>> >> needed to interpret this.
> >>>> >>
> >>>> >> Please let me know the cause of this crash, when this crash happens I
> >>>> >> noticed that two osds on separate machines are down. I can bring one
> >>>> >> osd up but restarting the other osd causes both OSDs to crash. My
> >>>> >> understanding is the crash seems to happen when two OSDs try to
> >>>> >> communicate and replicate a particular PG.
> >>>> >
> >>>> > Can you include the log lines that preceed the dump above?  In particular,
> >>>> > there should be a line that tells you what assertion failed in what
> >>>> > function and at what line number.  I haven't seen this crash so I'm not
> >>>> > sure offhand what it is.
> >>>> >
> >>>> > Thanks!
> >>>> > sage
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html