On Sat, Feb 11, 2017 at 2:51 PM, Ashley Merrick <ashley@xxxxxxxxxxxxxx> wrote: > Hello, > > > > I have a particular OSD (53), which at random will crash with the OSD > process stopping. > > > > OS: Debian 8.x > > CEPH : ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > > > From the logs at the time of the OSD being marked as crashed I can only see > the following: > > > > -4> 2017-02-10 23:40:16.820894 7fadbd049700 1 -- 172.16.3.7:6825/16969 > <== osd.26 172.16.2.104:0/5812 1 ==== osd_ping(ping e29842 stamp 2017-02$ > > -3> 2017-02-10 23:40:16.820918 7fadbd049700 1 -- 172.16.3.7:6825/16969 > --> 172.16.2.104:0/5812 -- osd_ping(ping_reply e29842 stamp 2017-02-10 2$ > > -2> 2017-02-10 23:40:16.822436 7faddb149700 1 -- > 172.16.2.107:6820/16969 <== client.8222771 172.16.2.2:0/1125091221 86 ==== > osd_op(client.82227$ > > -1> 2017-02-10 23:40:16.822453 7faddb149700 5 -- op tracker -- seq: > 670, time: 2017-02-10 23:40:16.822453, event: queued_for_pg, op: osd_op(cli$ > > 0> 2017-02-10 23:40:16.832241 7fadd0631700 -1 *** Caught signal > (Aborted) ** > > in thread 7fadd0631700 thread_name:tp_osd_tp > > > > ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) > > 1: (()+0x951cc7) [0x5556d8c4bcc7] > > 2: (()+0xf890) [0x7fadf5f8e890] > > 3: (gsignal()+0x37) [0x7fadf3fd5067] > > 4: (abort()+0x148) [0x7fadf3fd6448] > > 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x256) [0x5556d8d51296] > > 6: (FileStore::read(coll_t const&, ghobject_t const&, unsigned long, > unsigned long, ceph::buffer::list&, unsigned int, bool)+0xd7c) > [0x5556d89e68ec] > > 7: (ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned long, > unsigned long, unsigned int, ceph::buffer::list*)+0xcd) [0x5556d885ce7d] > > 8: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vector<OSDOp, > std::allocator<OSDOp> >&)+0x6355) [0x5556d87f6515] > > 9: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x61) > [0x5556d8802101] > > 10: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x936) > [0x5556d880a566] > > 11: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x37c3) > [0x5556d880f3d3] > > 12: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x727) [0x5556d87c6ae7] > > 13: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, > ThreadPool::TPHandle&)+0x420) [0x5556d866b650] > > 14: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6a) > [0x5556d866b8aa] > > 15: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x87a) [0x5556d8687f7a] > > 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8b6) > [0x5556d8d40c56] > > 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5556d8d42c10] > > 18: (()+0x8064) [0x7fadf5f87064] > > 19: (clone()+0x6d) [0x7fadf408862d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > > > > > Does this relate to anything or do I need to dig deeper to find the issue? It's likely a filesystem or hardware problem as it is failing an assert in FileStore::read. Could you thoroughly check the filesystem and the underlying hardware. You can possibly get more information about the specifics of the issue by caturing a log with debugging turned right up (20). > > > > ,Ashley > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com