OSD crash during pg repair - recovery_info.ss.clone_snaps.end and other problems

Jan Pekař - Imatic <jan.pekar@xxxxxxxxx> · Sat, 3 Mar 2018 11:27:53 +0100

Hi all,

I have few problems on my cluster, that are maybe linked together and 
now caused OSD down during pg repair.

First few notes about my cluster:

4 nodes, 15 OSDs installed on Luminous (no upgrade).
Replicated pools with 1 pool (pool 6) cached by ssd disks.
I don't detect any hardware failures (disk IO errors, restarts, 
corrupted data etc).
I'm running RBDs using libvirt on debian wheezy and jessie (stable and 
oldstable).
I'm snapshotting RBD's using Luminous client on Debian Jessie only.

Now problems, from light to severe:

1)
Almost every day I notice health some problems after deep scrub
1-2 inconsistent PG's with "read_error" on some osd's.
When I don't repair it, it disappears after few days (? another deep 
scrub). There is no read_error on disks (disk check ok, no errors logged 
in syslog).

2)
I noticed on my pool 6 (cached pool), that scrub reports some objects, 
that shouldn't be there:

2018-02-27 23:43:06.490152 7f4b3820e700 -1 osd.1 pg_epoch: 8712 pg[6.20( 
v 8712'771984 (8712'770478,8712'771984] local-lis/les=8710/8711 n=14299 
ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 8710/8710/8710) 
[1,10,14] r=0 lpr=8710 crt=8712'771984 lcod 8712'771983 mlcod 
8712'771983 active+clean+scrubbing+deep+inconsistent+repair] _scan_snaps 
no head for 6:07ffbc7b:::rbd_data.967992ae8944a.0000000000061cb8:c2 
(have MIN)

I think, that means orphaned snap object without his head replica. Maybe 
snaptrim left it there? Why? Maybe error during snaptrim? Or 
fstrim/discard removed "head" object (this is I hope nonsense)?

3)
I ended with one object (probably snap object), that has only 1 replica 
(out from size 3) and when I try to repair it, my OSD crash with

/build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())
I guess, that it detected orphaned snap object I noticed at 2) and don't 
repair it, just assterts and stop OSD. Am I right?

I noticed comment "// hmm, should we warn?" on ceph source at that 
assert code. So should someone remove that assert?

And my questions are

How can I fix issue with crashing OSD?
How can I safely remove that objects with missing head? Is there any 
tool or force-snaptrim on non-existent snapshots? It is prod cluster so 
I want to be careful. I have no problems now with data availability.
My last idea is to move RBD's to another pool, but have not enough space 
to do that (as I know RBD can only copy not move) so I'm looking for 
another clean solution.
And last question - how can I find, what is causing that read_erros and 
snap object leftovers?

Should I paste my whole log? It is bigger than allowed post size. 
Pasting most important events:

   -23> 2018-02-27 23:43:07.903368 7f4b3820e700  2 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+clean+scrubbing+deep+inconsistent+repair] 6.20 
repair 1 missing, 0 inconsistent objects
   -22> 2018-02-27 23:43:07.903410 7f4b3820e700 -1 log_channel(cluster) 
log [ERR] : 6.20 repair 1 missing, 0 inconsistent objects
   -21> 2018-02-27 23:43:07.903446 7f4b3820e700 -1 log_channel(cluster) 
log [ERR] : 6.20 repair 3 errors, 2 fixed
   -20> 2018-02-27 23:43:07.903480 7f4b3820e700  5 
write_log_and_missing with: dirty_to: 0'0, dirty_from: 
4294967295'18446744073709551615, writeout_from: 
4294967295'18446744073709551615, trimmed: , trimmed_dups: , 
clear_divergent_priors: 0
   -19> 2018-02-27 23:43:07.903604 7f4b3820e700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c021]:6803/3001666 -- MOSDScrubReserve(6.20 RELEASE 
e8712) v1 -- 0x55a4c5459c00 con 0
   -18> 2018-02-27 23:43:07.903651 7f4b3820e700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c041]:6802/3012729 -- MOSDScrubReserve(6.20 RELEASE 
e8712) v1 -- 0x55a4cb6dee00 con 0
   -17> 2018-02-27 23:43:07.903679 7f4b3820e700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c021]:6803/3001666 -- pg_info((query:8712 sent:8712 
6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710))=(empty) epoch 8712) v5 -- 0x55a4cba734a0 con 0
   -16> 2018-02-27 23:43:07.903704 7f4b3820e700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c041]:6802/3012729 -- pg_info((query:8712 sent:8712 
6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710))=(empty) epoch 8712) v5 -- 0x55a4cccb5680 con 0
   -15> 2018-02-27 23:43:07.903746 7f4b39a11700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+degraded+inconsistent m=1] exit 
Started/Primary/Active/Clean 7633.321426 4 0.000185
   -14> 2018-02-27 23:43:07.903776 7f4b39a11700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+degraded+inconsistent m=1] enter 
Started/Primary/Active/WaitLocalRecoveryReserved
   -13> 2018-02-27 23:43:07.903953 7f4b3a212700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] exit 
Started/Primary/Active/WaitLocalRecoveryReserved 0.000177 1 0.000072
   -12> 2018-02-27 23:43:07.904009 7f4b3a212700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] enter 
Started/Primary/Active/WaitRemoteRecoveryReserved
   -11> 2018-02-27 23:43:07.904038 7f4b3a212700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c021]:6803/3001666 -- MRecoveryReserve(6.20 REQUEST 
e8712) v2 -- 0x55a4c5082780 con 0
   -10> 2018-02-27 23:43:07.983199 7f4b4e256700  5 -- 
[2a01:430:22a::cef:c011]:6805/514544 >> 
[2a01:430:22a::cef:c021]:6803/3001666 conn(0x55a4c4f87800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2215 cs=1 l=0). rx 
osd.10 seq 6199 0x55a4c5082780 MRecoveryReserve(6.20 GRANT e8712) v2
    -9> 2018-02-27 23:43:07.983275 7f4b43ac2700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 <== osd.10 
[2a01:430:22a::cef:c021]:6803/3001666 6199 ==== MRecoveryReserve(6.20 
GRANT e8712) v2 ==== 26+0+0 (3843789212 0 0) 0x55a4c5082780 con 
0x55a4c4f87800
    -8> 2018-02-27 23:43:07.983411 7f4b39a11700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c041]:6802/3012729 -- MRecoveryReserve(6.20 REQUEST 
e8712) v2 -- 0x55a4c5040780 con 0
    -7> 2018-02-27 23:43:07.984468 7f4b4da55700  5 -- 
[2a01:430:22a::cef:c011]:6805/514544 >> 
[2a01:430:22a::cef:c041]:6802/3012729 conn(0x55a4c4f89000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1197 cs=1 l=0). rx 
osd.14 seq 6198 0x55a4c5040780 MRecoveryReserve(6.20 GRANT e8712) v2
    -6> 2018-02-27 23:43:07.984527 7f4b43ac2700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 <== osd.14 
[2a01:430:22a::cef:c041]:6802/3012729 6198 ==== MRecoveryReserve(6.20 
GRANT e8712) v2 ==== 26+0+0 (3843789212 0 0) 0x55a4c5040780 con 
0x55a4c4f89000
    -5> 2018-02-27 23:43:07.984642 7f4b3a212700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] exit 
Started/Primary/Active/WaitRemoteRecoveryReserved 0.080633 2 0.000176
    -4> 2018-02-27 23:43:07.984705 7f4b3a212700  5 osd.1 pg_epoch: 8712 
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711 
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807 
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985 
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] enter 
Started/Primary/Active/Recovering
    -3> 2018-02-27 23:43:08.087536 7f4b3820e700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 --> 
[2a01:430:22a::cef:c041]:6802/3012729 -- MOSDPGPull(6.20 e8712/8710 cost 
8389608) v3 -- 0x55a4c7281600 con 0
    -2> 2018-02-27 23:43:08.281205 7f4b4da55700  5 -- 
[2a01:430:22a::cef:c011]:6805/514544 >> 
[2a01:430:22a::cef:c041]:6802/3012729 conn(0x55a4c4f89000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1197 cs=1 l=0). rx 
osd.14 seq 6199 0x55a4c7281600 MOSDPGPush(6.20 8712/8710 
[PushOp(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25, 
version: 6240'452090, data_included: [], data_size: 0, omap_header_size: 
0, omap_entries_size: 0, attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25@6240'452090, 
size: 0, copy_subset: [], clone_subset: 
{6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:d8=[]}, snapset: 
d8=[d8,d6,d4,d2,d0,ce,cb,ca,c8,c6,c4,c2,c0,be]:{be=[be],c0=[c0],c2=[c2],c6=[c6,c4],c8=[c8],ca=[ca],cb=[cb],ce=[ce],d0=[d0],d2=[d2],d4=[d4],d8=[d8,d6]}), 
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, 
data_complete:true, omap_recovered_to:, omap_complete:true, 
error:false), before_progress: ObjectRecoveryProgress(first, 
data_recovered_to:0, data_complete:false, omap_recovered_to:, 
omap_complete:false, error:false))]) v3
    -1> 2018-02-27 23:43:08.281240 7f4b4da55700  1 -- 
[2a01:430:22a::cef:c011]:6805/514544 <== osd.14 
[2a01:430:22a::cef:c041]:6802/3012729 6199 ==== MOSDPGPush(6.20 
8712/8710 
[PushOp(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25, 
version: 6240'452090, data_included: [], data_size: 0, omap_header_size: 
0, omap_entries_size: 0, attrset_size: 1, recovery_info: 
ObjectRecoveryInfo(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25@6240'452090, 
size: 0, copy_subset: [], clone_subset: 
{6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:d8=[]}, snapset: 
d8=[d8,d6,d4,d2,d0,ce,cb,ca,c8,c6,c4,c2,c0,be]:{be=[be],c0=[c0],c2=[c2],c6=[c6,c4],c8=[c8],ca=[ca],cb=[cb],ce=[ce],d0=[d0],d2=[d2],d4=[d4],d8=[d8,d6]}), 
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0, 
data_complete:true, omap_recovered_to:, omap_complete:true, 
error:false), before_progress: ObjectRecoveryProgress(first, 
data_recovered_to:0, data_complete:false, omap_recovered_to:, 
omap_complete:false, error:false))]) v3 ==== 2020+0+0 (3173406802 0 0) 
0x55a4c7281600 con 0x55a4c4f89000
     0> 2018-02-27 23:43:08.287138 7f4b3820e700 -1 
/build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: In function 'virtual void 
PrimaryLogPG::on_local_recover(const hobject_t&, const 
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)' 
thread 7f4b3820e700 time 2018-02-27 23:43:08.281422
/build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != 
recovery_info.ss.clone_snaps.end())

 ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949) 
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x55a4b78ad9b2]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, 
ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, 
ObjectStore::Transaction*)+0x11f0) [0x55a4b74581b0]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&, 
PullOp*, std::list<ReplicatedBackend::pull_complete_info, 
std::allocator<ReplicatedBackend::pull_complete_info> >*, 
ObjectStore::Transaction*)+0x788) [0x55a4b75d5ff8]
 4: 
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6) 
[0x55a4b75d7fc6]
 5: 
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x224) 
[0x55a4b75db204]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) 
[0x55a4b7500690]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&)+0x5f1) [0x55a4b746bf31]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, 
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7) 
[0x55a4b72f42d7]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> 
const&)+0x57) [0x55a4b7571137]
 10: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x108c) [0x55a4b73245ec]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d) 
[0x55a4b78b363d]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55a4b78b5600]
 13: (()+0x7494) [0x7f4b515f6494]
 14: (clone()+0x3f) [0x7f4b5067daff]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

Thank you very much

With regards
Jan Pekar

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com