OSDs crash after deleting unfound object in Luminous 12.2.8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

we are running a luminous 12.2.8 cluster with a 3 fold replicated cache pool with a min_size of 2. We recently encountered an "object unfound" error in one of our pgs in this pool. After marking this object lost, two of the acting osds crashed and were unable to start up again, with only the primary osd staying up. Hoping the cluster might remap the copies of this pg, we marked the two crashed osds as out. Now the primary osd of this pg has also gone down leaving again only one active osd with the cluster reporting a degraded filesystem. All the affected osds are running filestore, while about half the cluster has already been upgraded to run bluestore osds.

All three of the crashed osds fail to restart, reporting the following error during startup:

Oct 12 13:19:12 kaa-109 ceph-osd[166266]:      0> 2018-10-12 13:19:12.782652 7f1f2d79b700 -1 /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: In function ' void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813 /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 12985: FAILED assert(obc)

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x109) [0x562265bfda9c]  2: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942) [0x5622657d6cea]
 3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
 4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a) [0x562265800c64]  5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396) [0x562265655cf8]  7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x5622658c09a6]  8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]  9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd) [0x562265c026f5]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
 11: (()+0x751e) [0x7f1f4fd7f51e]
 12: (clone()+0x3f) [0x7f1f4ed7ef0f]

A search in the bug tracker revealed that a similar error has been resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't know if this is in any way relevant.

We are currently at a loss how to get these osds back up. Any suggestions how to approach this would be very welcome. If there is any further information that is needed or additional context please let me know.

Thanks,

Lawrence


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux