On Wed, Oct 18, 2017 at 11:16 PM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx> wrote: > hello, > > For 2 week, I lost sometime some OSD : > Here trace : > > 0> 2017-10-18 05:16:40.873511 7f7c1e497700 -1 osd/ReplicatedPG.cc: In > function '*void ReplicatedPG::hit_set_trim(*ReplicatedPG::OpContextUPtr&, > unsigned int)' thread 7f7c1e497700 time 2017-10-18 05:16:40.869962 > osd/ReplicatedPG.cc: 11782: FAILED assert(obc) Can you try to capture a log with debug_osd set to 10 or greater as per http://tracker.ceph.com/issues/19185 ? This will allow us to see the output from the PrimaryLogPG::get_object_context() function which may help identify the problem. Please also check your machines all have the same time zone set and their clocks are in sync. > > ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x55eec15a09e5] > 2: (ReplicatedPG::hit_set_trim(std::unique_ptr<ReplicatedPG::OpContext, > std::default_delete<ReplicatedPG::OpContext> >&, unsigned int)+0x6dd) > [0x55eec107a52d] > 3: (ReplicatedPG::hit_set_persist()+0xd7c) [0x55eec107d1bc] > 4: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x1a92) > [0x55eec109bbe2] > 5: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x747) [0x55eec10588a7] > 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, > ThreadPool::TPHandle&)+0x41d) [0x55eec0f0bbad] > 7: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) > [0x55eec0f0bdfd] > 8: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x77b) [0x55eec0f0f7db] > 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) > [0x55eec1590987] > 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55eec15928f0] > 11: (()+0x7e25) [0x7f7c4fd52e25] > 12: (clone()+0x6d) [0x7f7c4e3dc34d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > I am using Jewel 10.2.10 > > I am using Erasure coding pool (2+1) + Nvme cache tier (backwrite) with 3 > replica with simple RBD disk. > (12 OSD Sata disk on 4 nodes + 1 nvme on each node = 48 x OSD sata + 8 x > NVMe Osd (I split NVMe in 2). > Last week, it was only nvme OSD which crashed. So I unmap all disk, detroyed > cache and recreated It. > From this days, it work fine. Today, an OSD crahed. But it was not an NVME > OSD this time, a normal OSD (sata). > > Any idee ? what about this void "*ReplicatedPG::hit_set_trim". > > *thanks for your help,* > * > Regards, > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com