And my questions are
How can I fix issue with crashing OSD?
How can I safely remove that objects with missing head? Is there any
tool or force-snaptrim on non-existent snapshots? It is prod cluster so
I want to be careful. I have no problems now with data availability.
My last idea is to move RBD's to another pool, but have not enough space
to do that (as I know RBD can only copy not move) so I'm looking for
another clean solution.
And last question - how can I find, what is causing that read_erros and
snap object leftovers?
Should I paste my whole log? It is bigger than allowed post size.
Pasting most important events:
-23> 2018-02-27 23:43:07.903368 7f4b3820e700 2 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+clean+scrubbing+deep+inconsistent+repair] 6.20
repair 1 missing, 0 inconsistent objects
-22> 2018-02-27 23:43:07.903410 7f4b3820e700 -1
log_channel(cluster)
log [ERR] : 6.20 repair 1 missing, 0 inconsistent objects
-21> 2018-02-27 23:43:07.903446 7f4b3820e700 -1
log_channel(cluster)
log [ERR] : 6.20 repair 3 errors, 2 fixed
-20> 2018-02-27 23:43:07.903480 7f4b3820e700 5
write_log_and_missing with: dirty_to: 0'0, dirty_from:
4294967295'18446744073709551615, writeout_from:
4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
clear_divergent_priors: 0
-19> 2018-02-27 23:43:07.903604 7f4b3820e700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c021]:6803/3001666 -- MOSDScrubReserve(6.20 RELEASE
e8712) v1 -- 0x55a4c5459c00 con 0
-18> 2018-02-27 23:43:07.903651 7f4b3820e700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c041]:6802/3012729 -- MOSDScrubReserve(6.20 RELEASE
e8712) v1 -- 0x55a4cb6dee00 con 0
-17> 2018-02-27 23:43:07.903679 7f4b3820e700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c021]:6803/3001666 -- pg_info((query:8712 sent:8712
6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710))=(empty) epoch 8712) v5 -- 0x55a4cba734a0 con 0
-16> 2018-02-27 23:43:07.903704 7f4b3820e700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c041]:6802/3012729 -- pg_info((query:8712 sent:8712
6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710))=(empty) epoch 8712) v5 -- 0x55a4cccb5680 con 0
-15> 2018-02-27 23:43:07.903746 7f4b39a11700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+degraded+inconsistent m=1] exit
Started/Primary/Active/Clean 7633.321426 4 0.000185
-14> 2018-02-27 23:43:07.903776 7f4b39a11700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+degraded+inconsistent m=1] enter
Started/Primary/Active/WaitLocalRecoveryReserved
-13> 2018-02-27 23:43:07.903953 7f4b3a212700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] exit
Started/Primary/Active/WaitLocalRecoveryReserved 0.000177 1 0.000072
-12> 2018-02-27 23:43:07.904009 7f4b3a212700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] enter
Started/Primary/Active/WaitRemoteRecoveryReserved
-11> 2018-02-27 23:43:07.904038 7f4b3a212700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c021]:6803/3001666 -- MRecoveryReserve(6.20 REQUEST
e8712) v2 -- 0x55a4c5082780 con 0
-10> 2018-02-27 23:43:07.983199 7f4b4e256700 5 --
[2a01:430:22a::cef:c011]:6805/514544 >>
[2a01:430:22a::cef:c021]:6803/3001666 conn(0x55a4c4f87800 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=2215 cs=1 l=0). rx
osd.10 seq 6199 0x55a4c5082780 MRecoveryReserve(6.20 GRANT e8712) v2
-9> 2018-02-27 23:43:07.983275 7f4b43ac2700 1 --
[2a01:430:22a::cef:c011]:6805/514544 <== osd.10
[2a01:430:22a::cef:c021]:6803/3001666 6199 ==== MRecoveryReserve(6.20
GRANT e8712) v2 ==== 26+0+0 (3843789212 0 0) 0x55a4c5082780 con
0x55a4c4f87800
-8> 2018-02-27 23:43:07.983411 7f4b39a11700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c041]:6802/3012729 -- MRecoveryReserve(6.20 REQUEST
e8712) v2 -- 0x55a4c5040780 con 0
-7> 2018-02-27 23:43:07.984468 7f4b4da55700 5 --
[2a01:430:22a::cef:c011]:6805/514544 >>
[2a01:430:22a::cef:c041]:6802/3012729 conn(0x55a4c4f89000 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1197 cs=1 l=0). rx
osd.14 seq 6198 0x55a4c5040780 MRecoveryReserve(6.20 GRANT e8712) v2
-6> 2018-02-27 23:43:07.984527 7f4b43ac2700 1 --
[2a01:430:22a::cef:c011]:6805/514544 <== osd.14
[2a01:430:22a::cef:c041]:6802/3012729 6198 ==== MRecoveryReserve(6.20
GRANT e8712) v2 ==== 26+0+0 (3843789212 0 0) 0x55a4c5040780 con
0x55a4c4f89000
-5> 2018-02-27 23:43:07.984642 7f4b3a212700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] exit
Started/Primary/Active/WaitRemoteRecoveryReserved 0.080633 2 0.000176
-4> 2018-02-27 23:43:07.984705 7f4b3a212700 5 osd.1 pg_epoch:
8712
pg[6.20( v 8712'771986 (8712'770478,8712'771986] local-lis/les=8710/8711
n=14299 ec=4197/2380 lis/c 8710/8710 les/c/f 8711/8711/2807
8710/8710/8710) [1,10,14] r=0 lpr=8710 crt=8712'771986 lcod 8712'771985
mlcod 8712'771985 active+recovery_wait+degraded+inconsistent m=1] enter
Started/Primary/Active/Recovering
-3> 2018-02-27 23:43:08.087536 7f4b3820e700 1 --
[2a01:430:22a::cef:c011]:6805/514544 -->
[2a01:430:22a::cef:c041]:6802/3012729 -- MOSDPGPull(6.20 e8712/8710 cost
8389608) v3 -- 0x55a4c7281600 con 0
-2> 2018-02-27 23:43:08.281205 7f4b4da55700 5 --
[2a01:430:22a::cef:c011]:6805/514544 >>
[2a01:430:22a::cef:c041]:6802/3012729 conn(0x55a4c4f89000 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=1197 cs=1 l=0). rx
osd.14 seq 6199 0x55a4c7281600 MOSDPGPush(6.20 8712/8710
[PushOp(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25,
version: 6240'452090, data_included: [], data_size: 0, omap_header_size:
0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25@6240'452090,
size: 0, copy_subset: [], clone_subset:
{6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:d8=[]}, snapset:
d8=[d8,d6,d4,d2,d0,ce,cb,ca,c8,c6,c4,c2,c0,be]:{be=[be],c0=[c0],c2=[c2],c6=[c6,c4],c8=[c8],ca=[ca],cb=[cb],ce=[ce],d0=[d0],d2=[d2],d4=[d4],d8=[d8,d6]}),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true,
error:false), before_progress: ObjectRecoveryProgress(first,
data_recovered_to:0, data_complete:false, omap_recovered_to:,
omap_complete:false, error:false))]) v3
-1> 2018-02-27 23:43:08.281240 7f4b4da55700 1 --
[2a01:430:22a::cef:c011]:6805/514544 <== osd.14
[2a01:430:22a::cef:c041]:6802/3012729 6199 ==== MOSDPGPush(6.20
8712/8710
[PushOp(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25,
version: 6240'452090, data_included: [], data_size: 0, omap_header_size:
0, omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:25@6240'452090,
size: 0, copy_subset: [], clone_subset:
{6:0663e376:::rbd_data.967992ae8944a.000000000006b41f:d8=[]}, snapset:
d8=[d8,d6,d4,d2,d0,ce,cb,ca,c8,c6,c4,c2,c0,be]:{be=[be],c0=[c0],c2=[c2],c6=[c6,c4],c8=[c8],ca=[ca],cb=[cb],ce=[ce],d0=[d0],d2=[d2],d4=[d4],d8=[d8,d6]}),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true,
error:false), before_progress: ObjectRecoveryProgress(first,
data_recovered_to:0, data_complete:false, omap_recovered_to:,
omap_complete:false, error:false))]) v3 ==== 2020+0+0 (3173406802
<tel:(317)%20340-6802> 0 0)
0x55a4c7281600 con 0x55a4c4f89000
0> 2018-02-27 23:43:08.287138 7f4b3820e700 -1
/build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f4b3820e700 time 2018-02-27 23:43:08.281422
/build/ceph-12.2.3/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())
ceph version 12.2.3 (2dab17a455c09584f2a85e6b10888337d1ec8949)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55a4b78ad9b2]
2: (PrimaryLogPG::on_local_recover(hobject_t const&,
ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0x11f0) [0x55a4b74581b0]
3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp
const&,
PullOp*, std::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >*,
ObjectStore::Transaction*)+0x788) [0x55a4b75d5ff8]
4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2a6)
[0x55a4b75d7fc6]
5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x224)
[0x55a4b75db204]
6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x55a4b7500690]
7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x5f1) [0x55a4b746bf31]
8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f7)
[0x55a4b72f42d7]
9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x55a4b7571137]
10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x108c) [0x55a4b73245ec]
11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x88d)
[0x55a4b78b363d]
12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0x55a4b78b5600]
13: (()+0x7494) [0x7f4b515f6494]
14: (clone()+0x3f) [0x7f4b5067daff]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Thank you very much
With regards
Jan Pekar
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com