On 4-11-2016 21:09, Willem Jan Withagen wrote: > On 4-11-2016 20:39, David Zafman wrote: >> >> I'm working a similar problem with the test-erasure-eio.sh test which >> only fails on Jenkins. I have a pg that is active+degraded and then >> active+recovery_wait+degraded. In this case the hinfo is missing after >> osds are brought up and down in order to use the ceph-objectstore-tool >> to implement the test cases. On Jenkins the osd restarting causes >> different pg mappings then on my build machine and recovery can't make >> progress. >> >> See if you see these message in the osd log of the primary of the pg (in >> your example below that would be osd.3: >> >> handle_sub_read_reply shard=1(1) error=-5 >> >> _failed_push: canceling recovery op for obj ... > > I have something like 250.000 !!! lines looking like: > > 2016-11-04 18:09:52.769091 ba9c480 10 osd.3 pg_epoch: 117 pg[2.0s0( v > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84) > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod > 0'0 active+recovering+degraded] _failed_push: canceling recovery op for > obj 2:eb822e21:::SOMETHING:head > > And another fragment is: > 2016-11-04 17:55:00.233049 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84) > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod > 0'0 active+recovering+degraded] handle_message: > MOSDECSubOpReadReply(2.0s0 117 ECSubReadReply(tid=14, attrs_read=0)) v1 > 2016-11-04 17:55:00.233099 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84) > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod > 0'0 active+recovering+degraded] handle_sub_read_reply: reply > ECSubReadReply(tid=14, attrs_read=0) > 2016-11-04 17:55:00.233150 ba6ad00 20 osd.3 pg_epoch: 117 pg[2.0s0( v > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84) > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod > 0'0 active+recovering+degraded] handle_sub_read_reply shard=9(2) error=-2 > 2016-11-04 17:55:00.233200 ba6ad00 10 osd.3 pg_epoch: 117 pg[2.0s0( v > 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/115/84) > [3,1,9,0,6,2] r=0 lpr=115 pi=103-114/5 rops=1 crt=72'1 lcod 0'0 mlcod > 0'0 active+recovering+degraded] handle_sub_read_reply readop not > complete: ReadOp(tid=14, > to_read={2:eb822e21:::SOMETHING:head=read_request_t(to_read=[0,8388608,0], > need=0(3),3(0),6(4),9(2), want_attrs=1)}, > complete={2:eb822e21:::SOMETHING:head=read_result_t(r=0, > errors={9(2)=-2}, noattrs, returned=(0, 8388608, [3(0),1024]))}, > priority=3, > obj_to_source={2:eb822e21:::SOMETHING:head=0(3),3(0),6(4),9(2)}, > source_to_obj={0(3)=2:eb822e21:::SOMETHING:head,3(0)=2:eb822e21:::SOMETHING:head,6(4)=2:eb822e21:::SOMETHING:head,9(2)=2:eb822e21:::SOMETHING:head}, > in_progress=0(3),6(4)) And the log of osd.9 is also ful with the following sequence: 2016-11-04 18:09:51.995641 bab2000 20 osd.9 117 share_map_peer 0xbb0c000 already has epoch 117 2016-11-04 18:09:51.995848 bab2000 10 osd.9 117 dequeue_op 0xbf59180 finish 2016-11-04 18:09:52.005020 b8b3200 10 osd.9 117 handle_replica_op MOSDECSubOpRead(2.0s2 117 ECSubRead(tid=86489, to_read={2:eb822e21 :::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 epoch 117 2016-11-04 18:09:52.005149 b8b3200 20 osd.9 117 should_share_map osd.3 127.0.0.1:6813/67592 117 2016-11-04 18:09:52.005229 b8b3200 15 osd.9 117 enqueue_op 0xbf5ae40 prio 3 cost 0 latency 0.000378 MOSDECSubOpRead(2.0s2 117 ECSubR ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 2016-11-04 18:09:52.005408 bab2000 10 osd.9 117 dequeue_op 0xbf5ae40 prio 3 cost 0 latency 0.000558 MOSDECSubOpRead(2.0s2 117 ECSubR ead(tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 pg pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n =1 ec=69 les/c/f 116/105/0 113/115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active] 2016-11-04 18:09:52.005539 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/ 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active] handle_message: MOSDECSubOpRead(2.0s2 117 ECSubRead (tid=86489, to_read={2:eb822e21:::SOMETHING:head=0,2097152,0}, attrs_to_read=)) v2 2016-11-04 18:09:52.005595 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/ 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active] get_hash_info: Getting attr on 2:eb822e21:::SOMETHI NG:head 2016-11-04 18:09:52.005641 bab2000 10 osd.9 pg_epoch: 117 pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/ 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active] get_hash_info: not in cache 2:eb822e21:::SOMETHING: head 2016-11-04 18:09:52.005853 bab2000 -1 log_channel(cluster) log [ERR] : handle_sub_read: Error -2 reading 2:eb822e21:::SOMETHING:head 2016-11-04 18:09:52.005880 bab2000 5 osd.9 pg_epoch: 117 pg[2.0s2( v 72'1 (0'0,72'1] local-les=116 n=1 ec=69 les/c/f 116/105/0 113/ 115/84) [3,1,9,0,6,2] r=2 lpr=115 pi=69-114/22 luod=0'0 crt=0'0 lcod 0'0 active] handle_sub_read: Error -2 reading 2:eb822e21:::SOME THING:head 2016-11-04 18:09:52.005940 bab2000 20 osd.9 117 share_map_peer 0xbb0c000 already has epoch 117 And error -2 would read more or less: #define ENOENT 2 /* No such file or directory */ And these are the SOMETHING__head files in the test: 1 -rw-r--r-- 1 jenkins jenkins 1024 Nov 4 17:53 ./0/current/2.0s3_head/SOMETHING__head_847441D7__2_ffffffffffffffff_3 1 -rw-r--r-- 1 jenkins jenkins 1024 Nov 4 17:53 ./2/current/2.0s5_head/SOMETHING__head_847441D7__2_ffffffffffffffff_5 1 -rw-r--r-- 1 jenkins jenkins 1024 Nov 4 17:54 ./3/current/2.0s0_head/SOMETHING__head_847441D7__2_ffffffffffffffff_0 1 -rw-r--r-- 1 jenkins jenkins 1024 Nov 4 17:53 ./6/current/2.0s4_head/SOMETHING__head_847441D7__2_ffffffffffffffff_4 Perhaps this 847441D7 should read EB822E21 ?? And is this the reason of the read error? --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html