Is this planned to be merged into Luminous at some point?
,Ashley
From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx]
Sent: Tuesday, 6 June 2017 2:24 AM
To: Ashley Merrick <ashley@xxxxxxxxxxxxxx>; ceph-users@xxxxxxxx
Cc: David Zafman <dzafman@xxxxxxxxxx>
Subject: Re: [ceph-users] PG Stuck EC Pool
It looks to me like this is related to http://tracker.ceph.com/issues/18162.
You might see if they came up with good resolution steps, and it looks like David is working on it in master but hasn't finished it yet.
Attaching with logging to level 20.
After repeat attempts by removing nobackfill I have got it down to:
recovery 31892/272325586 objects degraded (0.012%)
recovery 2/272325586 objects misplaced (0.000%)
However any further attempts after removing nobackfill just causes an instant crash on 83 & 84, at this point I feel there is some corruption on the remaining
11 OSD’s of the PG however the error’s aren’t directly saying that, however always end the crash with:
-1 *** Caught signal (Aborted) ** in thread 7f716e862700 thread_name:tp_osd_recov
,Ashley
|
This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing
|
Feedback
|
I have now done some further testing and seeing these errors on 84 / 83 the OSD’s that crash while backfilling to 10,11
-60> 2017-06-03 10:08:56.651768 7f6f76714700 1 --
172.16.3.14:6823/2694 <== osd.3
172.16.2.101:0/25361 10 ==== osd_ping(ping e71688 stamp 2017-06-03 10:08:56.652035) v2 ==== 47+0+0 (1097709006 0 0) 0x5569ea88d400 con 0x5569e900e300
-59> 2017-06-03 10:08:56.651804 7f6f76714700 1 --
172.16.3.14:6823/2694 -->
172.16.2.101:0/25361 -- osd_ping(ping_reply e71688 stamp 2017-06-03 10:08:56.652035) v2 -- ?+0 0x5569e985fc00 con 0x5569e900e300
-6> 2017-06-03 10:08:56.937156 7f6f5ee4d700 1 --
172.16.3.14:6822/2694 <== osd.53
172.16.3.7:6816/15230 13 ==== MOSDECSubOpReadReply(6.14s3 71688 ECSubReadReply(tid=83, attrs_read=0)) v1 ==== 148+0+0 (2355392791 0 0) 0x5569e8b22080 con 0x5569e9538f00
-5> 2017-06-03 10:08:56.937193 7f6f5ee4d700 5 -- op tracker -- seq: 2409, time: 2017-06-03 10:08:56.937193, event: queued_for_pg, op: MOSDECSubOpReadReply(6.14s3
71688 ECSubReadReply(tid=83, attrs_read=0))
-4> 2017-06-03 10:08:56.937241 7f6f8ef8a700 5 -- op tracker -- seq: 2409, time: 2017-06-03 10:08:56.937240, event: reached_pg, op: MOSDECSubOpReadReply(6.14s3
71688 ECSubReadReply(tid=83, attrs_read=0))
-3> 2017-06-03 10:08:56.937266 7f6f8ef8a700 0 osd.83 pg_epoch: 71688 pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 ec=31534
les/c/f 71688/69510/67943 71687/71687/71687) [11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46] r=3 lpr=71687 pi=47065-71686/711 rops=1 bft=10(1),11(0) crt=71629'35509 mlcod 0'0 active+undersized+degraded+remapped+inconsistent+backfilling
NIBBLEWISE] failed_push 6:28170432:::rbd_data.e3d8852ae8944a.0000000000047d28:head from shard 53(8), reps on unfound? 0
-2> 2017-06-03 10:08:56.937346 7f6f8ef8a700 5 -- op tracker -- seq: 2409, time: 2017-06-03 10:08:56.937345, event: done, op: MOSDECSubOpReadReply(6.14s3
71688 ECSubReadReply(tid=83, attrs_read=0))
-1> 2017-06-03 10:08:56.937351 7f6f89f80700 -1 osd.83 pg_epoch: 71688 pg[6.14s3( v 71685'35512 (68694'30812,71685'35512] local-les=71688 n=15928 ec=31534
les/c/f 71688/69510/67943 71687/71687/71687) [11,10,2147483647,83,22,26,69,72,53,59,8,4,46]/[2147483647,2147483647,2147483647,83,22,26,69,72,53,59,8,4,46] r=3 lpr=71687 pi=47065-71686/711 bft=10(1),11(0) crt=71629'35509 mlcod 0'0 active+undersized+degraded+remapped+inconsistent+backfilling
NIBBLEWISE] recover_replicas: object added to missing set for backfill, but is not in recovering, error!
-42> 2017-06-03 10:08:56.968433 7f6f5f04f700 1 --
172.16.2.114:6822/2694 <== client.22857445
172.16.2.212:0/2238053329 56 ==== osd_op(client.22857445.1:759236283 2.e732321d rbd_data.61b4c6238e1f29.000000000001ea27 [set-alloc-hint object_size 4194304 write_size 4194304,write 126976~45056]
snapc 0=[] ondisk+write e71688) v4 ==== 217+0+45056 (2626314663 0 3883338397) 0x5569ea886b00 con 0x5569ea99c880
From this extract from pg query:
"up": [
11,
10,
84,
83,
22,
26,
69,
72,
53,
59,
8,
4,
46
],
"acting": [
2147483647,
2147483647,
84,
83,
22,
26,
69,
72,
53,
59,
8,
4,
46
I am wondering if there is an issue on 11 , 10 causing the current active primary “acting_primar": 84” to crash.
But can’t see anything that could be causing it.
,Ashley
From: Ashley Merrick
Sent: 01 June 2017 23:39
To: ceph-users@xxxxxxxx
Subject: RE: PG Stuck EC Pool
Have attached the full pg query for the effected PG encase this shows anything of interest.
Thanks
|
This sender failed our fraud detection checks and may not be who they appear to be. Learn about spoofing
|
Feedback
|
Have a PG which is stuck in this state (Is an EC with K=10 M=3)
pg 6.14 is active+undersized+degraded+remapped+inconsistent+backfilling, acting
[2147483647,2147483647,84,83,22,26,69,72,53,59,8,4,46]
Currently have no-recover set, if I unset no recover both OSD 83 + 84 start to flap and go up and down, I see the following in the log's of the OSD.
*****
-5> 2017-06-01 10:08:29.658593 7f430ec97700 1 --
172.16.3.14:6806/5204 <== osd.17
172.16.3.3:6806/2006016 57 ==== MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1)) v1 ==== 67+0+0 (245959818 0 0) 0x563c9db7be00 con 0x563c9cfca480
-4> 2017-06-01 10:08:29.658620 7f430ec97700 5 -- op tracker -- seq: 2367, time: 2017-06-01 10:08:29.658620, event: queued_for_pg, op:
MOSDECSubOpWriteReply(6.31as0 71513 ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1))
-3> 2017-06-01 10:08:29.658649 7f4319e11700 5 -- op tracker -- seq: 2367, time: 2017-06-01 10:08:29.658649, event: reached_pg, op: MOSDECSubOpWriteReply(6.31as0
71513 ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1))
-2> 2017-06-01 10:08:29.658661 7f4319e11700 5 -- op tracker -- seq: 2367, time: 2017-06-01 10:08:29.658660, event: done, op: MOSDECSubOpWriteReply(6.31as0
71513 ECSubWriteReply(tid=152, last_complete=0'0, committed=0, applied=1))
-1> 2017-06-01 10:08:29.663107 7f43320ec700 5 -- op tracker -- seq: 2317, time: 2017-06-01 10:08:29.663107, event: sub_op_applied, op:
osd_op(osd.79.66617:8675008 6.82058b1a rbd_data.e5208a238e1f29.0000000000025f3e [copy-from ver 4678410] snapc 0=[] ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected e71513)
0> 2017-06-01 10:08:29.663474 7f4319610700 -1 *** Caught signal (Aborted) **
in thread 7f4319610700 thread_name:tp_osd_recov
ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
1: (()+0x9564a7) [0x563c6a6f24a7]
2: (()+0xf890) [0x7f4342308890]
3: (gsignal()+0x37) [0x7f434034f067]
4: (abort()+0x148) [0x7f4340350448]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x563c6a7f83d6]
6: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0x62f) [0x563c6a2850ff]
7: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xa8a) [0x563c6a2b878a]
8: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x36d) [0x563c6a131bbd]
9: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x1d) [0x563c6a17c88d]
10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa9f) [0x563c6a7e8e3f]
11: (ThreadPool::WorkThread::entry()+0x10) [0x563c6a7e9d70]
12: (()+0x8064) [0x7f4342301064]
13: (clone()+0x6d) [0x7f434040262d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
What should my next steps be?
Thanks!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|