It would be helpful to have a full crash log with debug osd = 0/20 and the information in which pool and pg you marked the object as lost. You might be able to use ceph-objectstore-tool to remove the bad object from the OSD if it still exists in either the cache pool or underlying pool. Ugly fix if that doesn't work: patch the code to just ignore that instead of asserting. Only do that after verifying that it's acutally crashing on the object you deleted, ideally you'd also check the object name and only do that on the known bad object. Also, no guarantee that this is safe to do, but I think it should be here (after a very short look at the crashing code) I also had to hardcode an object name of a badly corrupted object into an OSD to ignore it to prevent a crash an OSD with the last surving copy of a PG, fun times... (ceph-objectstore-tool wouldn't even recognize that the object with that name exists in my case) Paul Am Fr., 12. Okt. 2018 um 15:34 Uhr schrieb Lawrence Smith <lawrence.smith@xxxxxxxxxxxxxxx>: > > Hi all, > > we are running a luminous 12.2.8 cluster with a 3 fold replicated cache > pool with a min_size of 2. We recently encountered an "object unfound" > error in one of our pgs in this pool. After marking this object lost, > two of the acting osds crashed and were unable to start up again, with > only the primary osd staying up. Hoping the cluster might remap the > copies of this pg, we marked the two crashed osds as out. Now the > primary osd of this pg has also gone down leaving again only one active > osd with the cluster reporting a degraded filesystem. All the affected > osds are running filestore, while about half the cluster has already > been upgraded to run bluestore osds. > > All three of the crashed osds fail to restart, reporting the following > error during startup: > > Oct 12 13:19:12 kaa-109 ceph-osd[166266]: 0> 2018-10-12 > 13:19:12.782652 7f1f2d79b700 -1 > /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: > In function ' > void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned > int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813 > /var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc: > 12985: FAILED assert(obc) > > ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) > luminous (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x109) [0x562265bfda9c] > 2: > (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, > std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942) > [0x5622657d6cea] > 3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab] > 4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a) > [0x562265800c64] > 5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed] > 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, > boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396) > [0x562265655cf8] > 7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> > const&)+0x5a) [0x5622658c09a6] > 8: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918] > 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd) > [0x562265c026f5] > 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72] > 11: (()+0x751e) [0x7f1f4fd7f51e] > 12: (clone()+0x3f) [0x7f1f4ed7ef0f] > > A search in the bug tracker revealed that a similar error has been > resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't > know if this is in any way relevant. > > We are currently at a loss how to get these osds back up. Any > suggestions how to approach this would be very welcome. If there is any > further information that is needed or additional context please let me know. > > Thanks, > > Lawrence > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com