OSDs crash after deleting unfound object in Luminous 12.2.8

Lawrence Smith <lawrence.smith@xxxxxxxxxxxxxxx> · Fri, 12 Oct 2018 15:34:18 +0200

Hi all,

we are running a luminous 12.2.8 cluster with a 3 fold replicated cache 
pool with a min_size of 2. We recently encountered an "object unfound" 
error in one of our pgs in this pool. After marking this object lost, 
two of the acting osds crashed and were unable to start up again, with 
only the primary osd staying up. Hoping the cluster might remap the 
copies of this pg, we marked the two crashed osds as out. Now the 
primary osd of this pg has also gone down leaving again only one active 
osd with the cluster reporting a degraded filesystem. All the affected 
osds are running filestore, while about half the cluster has already 
been upgraded to run bluestore osds.

All three of the crashed osds fail to restart, reporting the following 
error during startup:

Oct 12 13:19:12 kaa-109 ceph-osd[166266]:      0> 2018-10-12 
13:19:12.782652 7f1f2d79b700 -1 
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 
In function '
void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned 
int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: 
12985: FAILED assert(obc)

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x109) [0x562265bfda9c]
 2: 
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, 
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942) 
[0x5622657d6cea]
 3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
 4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a) 
[0x562265800c64]
 5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396) 
[0x562265655cf8]
 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> 
const&)+0x5a) [0x5622658c09a6]
 8: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd) 
[0x562265c026f5]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
 11: (()+0x751e) [0x7f1f4fd7f51e]
 12: (clone()+0x3f) [0x7f1f4ed7ef0f]

A search in the bug tracker revealed that a similar error has been 
resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't 
know if this is in any way relevant.

We are currently at a loss how to get these osds back up. Any 
suggestions how to approach this would be very welcome. If there is any 
further information that is needed or additional context please let me know.

Thanks,

Lawrence

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com