Hi all,
yesterday I got OSD down with error
2018-01-04 06:47:25.304513 7fe6eda51700 -1 log_channel(cluster) log
[ERR] : 6.20 repair 1 missing, 0 inconsistent objects
2018-01-04 06:47:25.312861 7fe6eda51700 -1 log_channel(cluster) log
[ERR] : 6.20 repair 3 errors, 2 fixed
2018-01-04 06:47:26.796659 7fe6eda51700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7fe6eda51700 time 2018-
01-04 06:47:26.649174
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x562994121de2]
2: (PrimaryLogPG::on_local_recover(hobject_t const&,
ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0x11f0) [0x562993ccec10]
3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >*,
ObjectStore::Transaction*)+0x788) [0x562993e4bb98]
4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6)
[0x562993e4db36]
5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x214)
[0x562993e50c04]
6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x562993d75ec0]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x77b) [0x562993ce265b]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
[0x562993b749e7]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x562993de6ad7]
10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x108c) [0x562993ba121c]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
[0x562994127a6d]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562994129a30]
13: (()+0x7494) [0x7fe706e36494]
14: (clone()+0x3f) [0x7fe705ebdaff]
I was unable to start that OSD, but I succeeded after few hours and got
OSD up (don't know why restart was not sufficient after OSD failure, but
later was OK).
What does it mean? How to prevent that?
Deep scrub after that showed, that there is missing object on 2 OSDs. It
seems to me, that this object is part of the deleted snapshot (from
"snap": 37 see below). Maybe during snaptrim something wrong happened
and object was not deleted from one OSD and now it looks inconsistent.
How can I find, that I can safely delete that object? That it is not
part of any RBD snapshot?
Below pasting rados list-inconsistent-obj --format=json-pretty 6.20
{
"epoch": 7240,
"inconsistents": [
{
"object": {
"name": "rbd_data.967992ae8944a.000000000006b41f",
"nspace": "",
"locator": "",
"snap": 37,
"version": 649618
},
"errors": [],
"union_shard_errors": [
"missing"
],
"selected_object_info":
"6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25(6240'452090
osd.1.0:251266 dirty|data_digest|omap_digest s 4194304 uv 649618 dd
e0468a41 od ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 1,
"primary": true,
"errors": [
"missing"
]
},
{
"osd": 10,
"primary": false,
"errors": [
"missing"
]
},
{
"osd": 14,
"primary": false,
"errors": [],
"size": 4194304,
"omap_digest": "0xffffffff",
"data_digest": "0xe0468a41"
}
]
}
]
}
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com