Here is a proper version of the log: 634 share_map osd.39 172.31.74.226:6805/1391 634 634 should_share_map osd.39 172.31.74.226:6805/1391 634 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_message: MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply: reply ECSubReadReply(tid=177, attrs_read=0) pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply shard=39(3) error=-5 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply readop not complete: ReadOp(tid=177, to_read={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_request_t(to_read=[0,67108864,0], need=2(4),14(1),37(0),39(3), want_attrs=0)}, complete={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_result_t(r=0, errors={39(3)=-5}, noattrs, returned=(0, 67108864, [37(0),16777216]))}, priority=127, obj_to_source={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=2(4),14(1),37(0),39(3)}, source_to_obj={2(4)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,14(1)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,37(0)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,39(3)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head}, in_progress=2(4),14(1)) 634 enqueue_op 0x7fbe28a3e680 prio 127 cost 0 latency 0.231999 epoch 634 MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 op_wq(0) _enqueue 17.13bs0 PGQueueable(0x7fbe28a3e680 prio 127 cost 0 e634) op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe28a3e680 prio 127 cost 0 e634) queued op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe28a3e680 prio 127 cost 0 e634) pg 0x7fbe2040d000 634 dequeue_op 0x7fbe28a3e680 prio 127 cost 0 latency 0.232034 MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 pg pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] 634 share_map osd.14 172.31.89.11:6805/1033 634 634 should_share_map osd.14 172.31.89.11:6805/1033 634 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_message: MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply: reply ECSubReadReply(tid=177, attrs_read=0) pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply readop not complete: ReadOp(tid=177, to_read={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_request_t(to_read=[0,67108864,0], need=2(4),14(1),37(0),39(3), want_attrs=0)}, complete={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_result_t(r=0, errors={39(3)=-5}, noattrs, returned=(0, 67108864, [14(1),16777216, 37(0),16777216]))}, priority=127, obj_to_source={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=2(4),14(1),37(0),39(3)}, source_to_obj={2(4)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,14(1)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,37(0)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,39(3)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head}, in_progress=2(4)) 634 dequeue_op 0x7fbe28a3e680 finish op_wq(0) _process empty q, waiting 634 enqueue_op 0x7fbe47807860 prio 127 cost 0 latency 0.414489 epoch 634 MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 op_wq(0) _enqueue 17.13bs0 PGQueueable(0x7fbe47807860 prio 127 cost 0 e634) op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe47807860 prio 127 cost 0 e634) queued op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe47807860 prio 127 cost 0 e634) pg 0x7fbe2040d000 634 dequeue_op 0x7fbe47807860 prio 127 cost 0 latency 0.414536 MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 pg pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] 634 share_map osd.2 172.31.20.174:6805/1908 634 634 should_share_map osd.2 172.31.20.174:6805/1908 634 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_message: MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=177, attrs_read=0)) v2 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply: reply ECSubReadReply(tid=177, attrs_read=0) pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply have shard=4 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply have shard=1 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply have shard=0 log_channel(cluster) log [ERR] : handle_sub_read_reply: Error(s) ignored for 17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head enough copies available pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply Error(s) ignored for 17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head enough copies available pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read_reply Complete: ReadOp(tid=177, to_read={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_request_t(to_read=[0,67108864,0], need=2(4),14(1),37(0),39(3), want_attrs=0)}, complete={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=read_result_t(r=0, errors={}, noattrs, returned=(0, 67108864, [2(4),16777216, 14(1),16777216, 37(0),16777216]))}, priority=127, obj_to_source={17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=2(4),14(1),37(0),39(3)}, source_to_obj={2(4)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,14(1)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,37(0)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head,39(3)=17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head}, in_progress=) 634 share_map_peer 0x7fbe21dd6000 already has epoch 634 634 share_map_peer 0x7fbe21dd6000 already has epoch 634 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] objects_read_async_cb: got: {17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head=0,{0~67108864(67108864)}} pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod 0'0 mlcod 0'0 active+clean] objects_read_async_cb: cache: ExtentCache( Regards, Oleg On Sat, Sep 2, 2017 at 3:46 PM, Oleg Kolosov <olekol@xxxxxxxxx> wrote: > Hi Sage, > I've rebuilt my code to fix the issues I had with rados bench and to > match CRUSH rule (so shards are placed as in lrc). > I'm running 'rados bench' and while reading I kill an osd and measure > cluster recovery. > The experiment runs well on the vstart.sh virtual cluster however I > have a failure in the real cluster running with 40 osds. > I've tested the reconstruction of my code, writing an object and > reading it while one of the osds is down (in noout) - that has also > worked. > > I've seen the following error in the log: > > osd.37 634 enqueue_op 0x7fbe2261c520 prio 127 cost 0 latency 0.303431 > epoch 634 MOSDECSubOpReadReply(17.13bs0 634/632 > ECSubReadReply(tid=182, attrs_read=0)) v2 > osd.37 op_wq(0) _enqueue 17.13bs0 PGQueueable(0x7fbe2261c520 prio 127 > cost 0 e634) > osd.37 op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe2261c520 > prio 127 cost 0 e634) queued > osd.37 op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe2261c520 > prio 127 cost 0 e634) pg 0x7fbe2040d000 > osd.37 634 dequeue_op 0x7fbe2261c520 prio 127 cost 0 latency 0.303503 > MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=182, > attrs_read=0)) v2 pg pg[17.13bs0( v 628'1 (0'0,628'1] local-li > osd.37 634 share_map osd.2 172.31.20.174:6805/1908 634 > osd.37 634 should_share_map osd.2 172.31.20.174:6805/1908 634 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 634 dequeue_op 0x7fbe2261c520 finish > osd.37 op_wq(0) _process empty q, waiting > osd.37 634 enqueue_op 0x7fbe203f2ec0 prio 127 cost 0 latency 0.329054 > epoch 634 MOSDECSubOpReadReply(17.13bs0 634/632 > ECSubReadReply(tid=182, attrs_read=0)) v2 > osd.37 op_wq(0) _enqueue 17.13bs0 PGQueueable(0x7fbe203f2ec0 prio 127 > cost 0 e634) > osd.37 op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe203f2ec0 > prio 127 cost 0 e634) queued > osd.37 op_wq(0) _process 17.13bs0 item PGQueueable(0x7fbe203f2ec0 > prio 127 cost 0 e634) pg 0x7fbe2040d000 > osd.37 634 dequeue_op 0x7fbe203f2ec0 prio 127 cost 0 latency 0.329081 > MOSDECSubOpReadReply(17.13bs0 634/632 ECSubReadReply(tid=182, > attrs_read=0)) v2 pg pg[17.13bs0( v 628'1 (0'0,628'1] local-li > osd.37 634 share_map osd.14 172.31.89.11:6805/1033 634 > osd.37 634 should_share_map osd.14 172.31.89.11:6805/1033 634 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > log_channel(cluster) log [ERR] : handle_sub_read_reply: Error(s) > ignored for 17:dcd09fc8:::benchmark_data_adm-1_14964_object2:head > enough copies available > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > osd.37 pg_epoch: 634 pg[17.13bs0( v 628'1 (0'0,628'1] > local-lis/les=632/633 n=1 ec=625/625 lis/c 632/632 les/c/f 633/633/0 > 632/632/625) [37,14,32,39,2,23,35,30,27] r=0 lpr=632 crt=628'1 lcod > 0'0 > > > > I've brought down osd.1 and it was replaced by osd.39. But when trying > to read the object osd.39 causes a problem for some reason (data > shards are 0,1,3,4. osd.39 contains shard 3). I should also mention > that the cluster recovers successfully in this experiment (reaches > HEALTH_OK). > I would think that something is wrong with my code, but as I've > mentioned both virtual cluster and specific object read are working. > > Thanks, > Oleg > > On Wed, Aug 2, 2017 at 11:54 AM, Oleg Kolosov <olekol@xxxxxxxxx> wrote: >> Hi Sage, >> You are correct! The problem is indeed only for my erasure code, >> however I'm not sure what it is. >> I've reproduced the issue in the virtual cluster created by vstart.sh >> (logging enabled) >> I've compared the logs of all OSDs with my erasure code versus a >> similar implementation of lrc - however there were no visible >> differences. >> Any tips on what should I look for? >> What might affect this in erasure code source code? >> >> Another iteresting thing I'm seeing is that even cleanup fails: >> >>> bin/rados -p optlrcpool cleanup >> error during cleanup: -5 >> error 5: (5) Input/output error >> >> This only occurs when I set 'noout' & stop and osd and then write and >> read with the benchmark. >> >> >> I should mentioned that my erasure code works properly besides this >> experiment. It manages to recover from a failed node during normal >> mode of operation. >> >> Thanks, >> Oleg >> >> On Tue, Aug 1, 2017 at 4:21 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> On Tue, 1 Aug 2017, Oleg Kolosov wrote: >>>> Hi >>>> I'm developing my own erasure code and I'm trying to run rados >>>> benchmark on it. My experiment consists of writing objects, then while >>>> reading them I stop an OSD and see how it affects latency. >>>> While running this experiment with 'rados bench' I get an error after >>>> stopping the osd: >>>> >>>> benchmark_data_adm-1_3806_object8 is not correct! >>>> read got -5 >>>> error during benchmark: -5 >>>> error 5: (5) Input/output error >>>> >>>> Is it expected to fail like this? >>> >>> It's hard to say what's happening without logs, but my guess would be that >>> your erasure code isn't correctly reconstructing the stripe when a >>> shard is missing. >>> >>>> Is there a way to bypass it or somehow perform my experiment? >>> >>> If it's what I'm guessing, not really... nor would you want to. I would >>> enable osd logging to see where the EIO is coming from. >>> >>> sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html