Re: OSDs crashing in EC pool (whack-a-mole)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



For the record, in the linked issue, it was thought that this might be
due to write caching. This seems not to be the case, as it happened
again to me with write caching disabled.

On Tue, Jan 8, 2019 at 11:15 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> I've seen this on luminous, but not on mimic.  Can you generate a log with
> debug osd = 20 leading up to the crash?
>
> Thanks!
> sage
>
>
> On Tue, 8 Jan 2019, Paul Emmerich wrote:
>
> > I've seen this before a few times but unfortunately there doesn't seem
> > to be a good solution at the moment :(
> >
> > See also: http://tracker.ceph.com/issues/23145
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Tue, Jan 8, 2019 at 9:37 AM David Young <funkypenguin@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hi all,
> > >
> > > One of my OSD hosts recently ran into RAM contention (was swapping heavily), and after rebooting, I'm seeing this error on random OSDs in the cluster:
> > >
> > > ---
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  1: /usr/bin/ceph-osd() [0xcac700]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  2: (()+0x11390) [0x7f8fa5d0e390]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  3: (gsignal()+0x38) [0x7f8fa5241428]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  4: (abort()+0x16a) [0x7f8fa524302a]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f8fa767c510]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  6: (()+0x2e5587) [0x7f8fa767c587]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x923) [0xbab5e3]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  8: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5c3) [0xbade03]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  9: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x82) [0x79c812]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  10: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x58) [0x730ff8]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe) [0x759aae]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x9c5720]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x769760]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7f8fa76824f6]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  16: (()+0x76ba) [0x7f8fa5d046ba]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  17: (clone()+0x6d) [0x7f8fa531341d]
> > > Jan 08 03:34:36 prod1 ceph-osd[3357939]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
> > > Jan 08 03:34:36 prod1 systemd[1]: ceph-osd@43.service: Main process exited, code=killed, status=6/ABRT
> > > ---
> > >
> > > I've restarted all the OSDs and the mons, but still encountering the above.
> > >
> > > Any ideas / suggestions?
> > >
> > > Thanks!
> > > D
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux