Hi all,
we are running a luminous 12.2.8 cluster with a 3 fold replicated cache
pool with a min_size of 2. We recently encountered an "object unfound"
error in one of our pgs in this pool. After marking this object lost,
two of the acting osds crashed and were unable to start up again, with
only the primary osd staying up. Hoping the cluster might remap the
copies of this pg, we marked the two crashed osds as out. Now the
primary osd of this pg has also gone down leaving again only one active
osd with the cluster reporting a degraded filesystem. All the affected
osds are running filestore, while about half the cluster has already
been upgraded to run bluestore osds.
All three of the crashed osds fail to restart, reporting the following
error during startup:
Oct 12 13:19:12 kaa-109 ceph-osd[166266]: 0> 2018-10-12
13:19:12.782652 7f1f2d79b700 -1
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
In function '
void PrimaryLogPG::hit_set_trim(PrimaryLogPG::OpContextUPtr&, unsigned
int)' thread 7f1f2d79b700 time 2018-10-12 13:19:12.779813
/var/tmp/portage/sys-cluster/ceph-12.2.8/work/ceph-12.2.8/src/osd/PrimaryLogPG.cc:
12985: FAILED assert(obc)
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x109) [0x562265bfda9c]
2:
(PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext,
std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x942)
[0x5622657d6cea]
3: (PrimaryLogPG::hit_set_persist()+0xa4b) [0x5622657e5fab]
4: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x426a)
[0x562265800c64]
5: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0xc1f) [0x5622657b94ed]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x396)
[0x562265655cf8]
7: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x5622658c09a6]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x1ab6) [0x562265657918]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5cd)
[0x562265c026f5]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562265c05e72]
11: (()+0x751e) [0x7f1f4fd7f51e]
12: (clone()+0x3f) [0x7f1f4ed7ef0f]
A search in the bug tracker revealed that a similar error has been
resolved for jewel http://tracker.ceph.com/issues/19223, yet I don't
know if this is in any way relevant.
We are currently at a loss how to get these osds back up. Any
suggestions how to approach this would be very welcome. If there is any
further information that is needed or additional context please let me know.
Thanks,
Lawrence
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com