Problem with OSD down and problematic rbd object

Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> · Fri, 5 Jan 2018 22:29:05 +0100

Hi all,

yesterday I got OSD down with error

2018-01-04 06:47:25.304513 7fe6eda51700 -1 log_channel(cluster) log 
[ERR] : 6.20 repair 1 missing, 0 inconsistent objects
2018-01-04 06:47:25.312861 7fe6eda51700 -1 log_channel(cluster) log 
[ERR] : 6.20 repair 3 errors, 2 fixed
2018-01-04 06:47:26.796659 7fe6eda51700 -1 
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void 
PrimaryLogPG::on_local_recover(const hobject_t&, const 
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' 
thread 7fe6eda51700 time 2018-
01-04 06:47:26.649174
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x562994121de2]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, 
ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, 
ObjectStore::Transaction*)+0x11f0) [0x562993ccec10]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, 
PullOp*, std::list<ReplicatedBackend::pull_complete_info, 
std::allocator<ReplicatedBackend::pull_complete_info> >*, 
ObjectStore::Transaction*)+0x788) [0x562993e4bb98]
 4: 
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6) 
[0x562993e4db36]
 5: 
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x214) 
[0x562993e50c04]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) 
[0x562993d75ec0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&)+0x77b) [0x562993ce265b]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7) 
[0x562993b749e7]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> 
const&)+0x57) [0x562993de6ad7]
 10: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x108c) [0x562993ba121c]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) 
[0x562994127a6d]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x562994129a30]
 13: (()+0x7494) [0x7fe706e36494]
 14: (clone()+0x3f) [0x7fe705ebdaff]

I was unable to start that OSD, but I succeeded after few hours and got 
OSD up (don't know why restart was not sufficient after OSD failure, but 
later was OK).

What does it mean? How to prevent that?

Deep scrub after that showed, that there is missing object on 2 OSDs. It 
seems to me, that this object is part of the deleted snapshot (from 
"snap": 37 see below). Maybe during snaptrim something wrong happened 
and object was not deleted from one OSD and now it looks inconsistent. 
How can I find, that I can safely delete that object? That it is not 
part of any RBD snapshot?

Below pasting  rados list-inconsistent-obj --format=json-pretty 6.20

{
    "epoch": 7240,
    "inconsistents": [
        {
            "object": {
                "name": "rbd_data.967992ae8944a.000000000006b41f",
                "nspace": "",
                "locator": "",
                "snap": 37,
                "version": 649618
            },
            "errors": [],
            "union_shard_errors": [
                "missing"
            ],
            "selected_object_info": 
"6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25(6240'452090 
osd.1.0:251266 dirty|data_digest|omap_digest s 4194304 uv 649618 dd 
e0468a41 od ffffffff alloc_hint [0 0 0])",
            "shards": [
                {
                    "osd": 1,
                    "primary": true,
                    "errors": [
                        "missing"
                    ]
                },
                {
                    "osd": 10,
                    "primary": false,
                    "errors": [
                        "missing"
                    ]
                },
                {
                    "osd": 14,
                    "primary": false,
                    "errors": [],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xe0468a41"
                }
            ]
        }
    ]
}

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com