At the risk of hijacking this thread, like I said I've ran into this problem again, and have captured a log with debug_osd=20, viewable at https://www.dropbox.com/s/8zoos5hhvakcpc4/ceph-osd.3.log?dl=0 - any pointers? On Tue, Jan 8, 2019 at 11:31 AM Peter Woodman <peter@xxxxxxxxxxxx> wrote: > > For the record, in the linked issue, it was thought that this might be > due to write caching. This seems not to be the case, as it happened > again to me with write caching disabled. > > On Tue, Jan 8, 2019 at 11:15 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > I've seen this on luminous, but not on mimic. Can you generate a log with > > debug osd = 20 leading up to the crash? > > > > Thanks! > > sage > > > > > > On Tue, 8 Jan 2019, Paul Emmerich wrote: > > > > > I've seen this before a few times but unfortunately there doesn't seem > > > to be a good solution at the moment :( > > > > > > See also: http://tracker.ceph.com/issues/23145 > > > > > > Paul > > > > > > -- > > > Paul Emmerich > > > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > > > croit GmbH > > > Freseniusstr. 31h > > > 81247 München > > > www.croit.io > > > Tel: +49 89 1896585 90 > > > > > > On Tue, Jan 8, 2019 at 9:37 AM David Young <funkypenguin@xxxxxxxxxxxxxx> wrote: > > > > > > > > Hi all, > > > > > > > > One of my OSD hosts recently ran into RAM contention (was swapping heavily), and after rebooting, I'm seeing this error on random OSDs in the cluster: > > > > > > > > --- > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 1: /usr/bin/ceph-osd() [0xcac700] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 2: (()+0x11390) [0x7f8fa5d0e390] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 3: (gsignal()+0x38) [0x7f8fa5241428] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 4: (abort()+0x16a) [0x7f8fa524302a] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x250) [0x7f8fa767c510] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 6: (()+0x2e5587) [0x7f8fa767c587] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 7: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0x923) [0xbab5e3] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 8: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5c3) [0xbade03] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 9: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x82) [0x79c812] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 10: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, ThreadPool::TPHandle*)+0x58) [0x730ff8] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe) [0x759aae] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x9c5720] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x769760] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7f8fa76824f6] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 16: (()+0x76ba) [0x7f8fa5d046ba] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 17: (clone()+0x6d) [0x7f8fa531341d] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > > > Jan 08 03:34:36 prod1 systemd[1]: ceph-osd@43.service: Main process exited, code=killed, status=6/ABRT > > > > --- > > > > > > > > I've restarted all the OSDs and the mons, but still encountering the above. > > > > > > > > Any ideas / suggestions? > > > > > > > > Thanks! > > > > D > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com