Re: NVMe over Fabrics I/O Error with RDMA/RXE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sagi,

Thanks.

> On 9 Jul 2018, at 18:42, Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> 
> 
>> Hi all,
> 
> Hi Bairen,
> 
>> I am testing NVMe over Fabrics on a Linux 4.15 VM (latest Ubuntu 18.04 LTS, unmodified kernel), with an NVMe card and SoftRoCE (RXE) as transport.
>> It works just fine for the client connecting NVMeof target. However, it reports I/O error and timeout when read/write to the target over RXE.
> 
> Thanks for reporting,
> 
> Would it be possible to test with upstream kernel? I'm testing with
> 4.17.0 and I/O seems to work fine...
> 
> I suspect that probably some stable fixes need to be taken to 4.15.y?
> 
> What is your exact kernel version?

The exact kernel version is 4.15.0-23.25 as in 

https://packages.ubuntu.com/bionic/linux-image-4.15.0-23-generic

I tried again with the latest Debian unstable (kernel version 4.16.5) and the symptoms remain the same.

Dmesg: (with loopback RXE and null block NVMe device)

[ 6842.291260] rdma_rxe: loaded
[ 6842.304579] rdma_rxe: set rxe0 active
[ 6842.305832] rdma_rxe: added rxe0 to eth1
[ 6842.339164] Loading iSCSI transport class v2.0-870.
[ 6842.361792] iscsi: registered transport (iser)
[ 6842.383476] RPC: Registered named UNIX socket transport module.
[ 6842.385534] RPC: Registered udp transport module.
[ 6842.387039] RPC: Registered tcp transport module.
[ 6842.388838] RPC: Registered tcp NFSv4.1 backchannel transport module.
[ 6842.410984] RPC: Registered rdma transport module.
[ 6842.412464] RPC: Registered rdma backchannel transport module.
[185834.283674] rdma_rxe: already configured on eth1
[185851.343314] rdma_rxe: unloaded
[185858.563434] rdma_rxe: loaded
[185858.574959] rdma_rxe: set rxe0 active
[185858.576092] rdma_rxe: added rxe0 to eth1
[185865.446437] nvmet: adding nsid 10 to subsystem nvmeof
[185877.282562] null: module loaded
[185964.432807] nvmet_rdma: enabling port 1 (10.130.27.157:4420)
[185997.428294] nvmet: creating controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN nqn.2014-08.org.nvmexpress:uuid:eff231b2-d35d-4d55-9c1c-af9bc0bc7513.
[185997.432770] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 10.130.27.157:4420
[186019.995552] nvmet: creating controller 1 for subsystem nvmeof for NQN nqn.2014-08.org.nvmexpress:uuid:eff231b2-d35d-4d55-9c1c-af9bc0bc7513.
[186019.999684] nvme nvme0: creating 4 I/O queues.
[186020.004922] nvme nvme0: new ctrl: NQN "nvmeof", addr 10.130.27.157:4420
[186135.448783] nvme nvme0: I/O 1 QID 3 timeout, reset controller
[186135.450532] print_req_error: I/O error, dev nvme0c1n1, sector 158839936
[186135.452608] nvme nvme0: I/O 2 QID 3 timeout, reset controller
[186135.454420] print_req_error: I/O error, dev nvme0c1n1, sector 151934592
[186135.456783] nvme nvme0: I/O 3 QID 3 timeout, reset controller
[186135.458578] print_req_error: I/O error, dev nvme0c1n1, sector 427118592
[186135.460564] nvme nvme0: I/O 4 QID 3 timeout, reset controller
[186135.462139] print_req_error: I/O error, dev nvme0c1n1, sector 208058112
[186135.463996] nvme nvme0: I/O 5 QID 3 timeout, reset controller
[186135.465744] print_req_error: I/O error, dev nvme0c1n1, sector 202496256
[186135.467683] nvme nvme0: I/O 6 QID 3 timeout, reset controller
[186135.469431] print_req_error: I/O error, dev nvme0c1n1, sector 220602880
[186135.471460] nvme nvme0: I/O 7 QID 3 timeout, reset controller
[186135.473098] print_req_error: I/O error, dev nvme0c1n1, sector 184964352
[186135.474866] nvme nvme0: I/O 8 QID 3 timeout, reset controller
[186135.476329] print_req_error: I/O error, dev nvme0c1n1, sector 181691264
[186135.478304] nvme nvme0: I/O 9 QID 3 timeout, reset controller
[186135.480063] print_req_error: I/O error, dev nvme0c1n1, sector 36848384
[186135.482098] nvme nvme0: I/O 10 QID 3 timeout, reset controller
[186135.485114] print_req_error: I/O error, dev nvme0c1n1, sector 501229312
[186135.487968] nvme nvme0: I/O 11 QID 3 timeout, reset controller
[186135.490272] nvme nvme0: I/O 12 QID 3 timeout, reset controller
[186135.492702] nvme nvme0: I/O 13 QID 3 timeout, reset controller
[186135.495424] nvme nvme0: I/O 14 QID 3 timeout, reset controller
[186135.497818] nvme nvme0: I/O 15 QID 3 timeout, reset controller
[186135.501531] nvme nvme0: I/O 16 QID 3 timeout, reset controller
[186135.504271] nvme nvme0: I/O 17 QID 3 timeout, reset controller
[186135.507185] nvme nvme0: I/O 18 QID 3 timeout, reset controller
[186135.509768] nvme nvme0: I/O 19 QID 3 timeout, reset controller
[186135.512342] nvme nvme0: I/O 20 QID 3 timeout, reset controller
[186135.515511] nvme nvme0: I/O 21 QID 3 timeout, reset controller
[186135.518412] nvme nvme0: I/O 22 QID 3 timeout, reset controller
[186135.521632] nvme nvme0: I/O 23 QID 3 timeout, reset controller
[186135.524941] nvme nvme0: I/O 24 QID 3 timeout, reset controller
[186135.527899] nvme nvme0: I/O 25 QID 3 timeout, reset controller
[186135.530993] nvme nvme0: I/O 26 QID 3 timeout, reset controller
[186135.533780] nvme nvme0: I/O 121 QID 3 timeout, reset controller
[186135.536581] nvme nvme0: I/O 122 QID 3 timeout, reset controller
[186135.539624] nvme nvme0: I/O 123 QID 3 timeout, reset controller
[186135.542805] nvme nvme0: I/O 124 QID 3 timeout, reset controller
[186135.545982] nvme nvme0: I/O 125 QID 3 timeout, reset controller
[186135.548889] nvme nvme0: I/O 126 QID 3 timeout, reset controller
[186135.551095] nvme nvme0: I/O 1 QID 4 timeout, reset controller
[186135.553682] nvme nvme0: I/O 2 QID 4 timeout, reset controller
[186135.556214] nvme nvme0: I/O 3 QID 4 timeout, reset controller
[186135.558771] nvme nvme0: I/O 4 QID 4 timeout, reset controller
[186135.561007] nvme nvme0: I/O 5 QID 4 timeout, reset controller
[186135.563897] nvme nvme0: I/O 6 QID 4 timeout, reset controller
[186135.566642] nvme nvme0: I/O 7 QID 4 timeout, reset controller
[186135.569510] nvme nvme0: I/O 8 QID 4 timeout, reset controller
[186135.572048] nvme nvme0: I/O 9 QID 4 timeout, reset controller
[186135.574628] nvme nvme0: I/O 10 QID 4 timeout, reset controller
[186135.576829] nvme nvme0: I/O 11 QID 4 timeout, reset controller
[186135.579154] nvme nvme0: I/O 12 QID 4 timeout, reset controller
[186135.581375] nvme nvme0: I/O 13 QID 4 timeout, reset controller
[186135.583684] nvme nvme0: I/O 14 QID 4 timeout, reset controller
[186135.585642] nvme nvme0: I/O 15 QID 4 timeout, reset controller
[186135.587936] nvme nvme0: I/O 16 QID 4 timeout, reset controller
[186135.590076] nvme nvme0: I/O 17 QID 4 timeout, reset controller
[186135.592338] nvme nvme0: I/O 18 QID 4 timeout, reset controller
[186135.594644] nvme nvme0: I/O 19 QID 4 timeout, reset controller
[186135.597330] nvme nvme0: I/O 20 QID 4 timeout, reset controller
[186135.599994] nvme nvme0: I/O 21 QID 4 timeout, reset controller
[186135.602204] nvme nvme0: I/O 22 QID 4 timeout, reset controller
[186135.604896] nvme nvme0: I/O 23 QID 4 timeout, reset controller
[186135.607214] nvme nvme0: I/O 24 QID 4 timeout, reset controller
[186135.609274] nvme nvme0: I/O 25 QID 4 timeout, reset controller
[186135.611314] nvme nvme0: I/O 26 QID 4 timeout, reset controller
[186135.613598] nvme nvme0: I/O 27 QID 4 timeout, reset controller
[186135.615549] nvme nvme0: I/O 28 QID 4 timeout, reset controller
[186135.617562] nvme nvme0: I/O 29 QID 4 timeout, reset controller
[186135.621237] nvme nvme0: I/O 30 QID 4 timeout, reset controller
[186135.623435] nvme nvme0: I/O 31 QID 4 timeout, reset controller
[186135.625352] nvme nvme0: I/O 32 QID 4 timeout, reset controller
[186135.653786] nvme nvme0: Reconnecting in 10 seconds...
[186145.897284] nvmet: creating controller 1 for subsystem nvmeof for NQN nqn.2014-08.org.nvmexpress:uuid:eff231b2-d35d-4d55-9c1c-af9bc0bc7513.
[186145.905109] nvme nvme0: creating 4 I/O queues.
[186145.911808] nvme nvme0: Successfully reconnected (1 attempts)
[186145.944882] nvme0c1n1: detected capacity change from 0 to 268435456000

I will try the upstream mainline kernel (4.17) in the next few days.

> 
> The difference from linux-stable-4.15.y is:
> --
> bb42f87e2924 IB/rxe: avoid unnecessary export
> da2f3b286a39 IB/rxe: avoid calling WARN_ON_ONCE twice
> e02637e97d68 IB: Fix RDMA_RXE and INFINIBAND_RDMAVT dependencies for DMA_VIRT_OPS
> f5e27a203f4c Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
> 1661d3b0e218 nvmet,rxe: defer ip datagram sending to tasklet
> ecb238f6a7f3 IB/cxgb4: use skb_put_zero()/__skb_put_zero
> 9fd4350ba895 IB/rxe: avoid double kfree_skb
> 2da36d44a9d5 IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV
> e12ee8ce5143 IB/rxe: remove unused function variable
> 0dff463a6a86 IB/rxe: change rxe_set_mtu function type to void
> 10c47d560603 IB/rxe: Change rxe_rcv to return void
> fe896ceb5772 IB/rxe: replace refcount_inc with skb_get
> 2e47350789eb IB/rxe: optimize the function duplicate_request
> 8f1a72c815cf IB/rxe: make rxe_release_udp_tunnel static
> 76be04500be2 IB/rxe: avoid export symbols
> 0f02ba7ed16a IB/rxe: make the variable static
> efc365e7290d IB/rxe: Fix for oops in rxe_register_device on ppc64le arch
> 39e00b6cf658 IB/rxe: Removed GID add/del dummy routines
> 414448d249d8 RDMA: Use ib_gid_attr during GID modification
> 3e44e0ee0893 IB/providers: Avoid null netdev check for RoCE
> 0e1f9b924471 RDMA/providers: Simplify query_gid callback of RoCE providers
> 99dae690255e IB/rxe: optimize mcast recv process
> f2e9bfac13c9 RDMA/rxe: Fix uABI structure layouts for 32/64 compat
> 0ede73bc012c IB/uverbs: Extend uverbs_ioctl header with driver_id
> 0c43ab371bcb RDMA/rxe: Use structs to describe the uABI instead of opencoding
> b92ec0fe3224 RDMA/rxe: Get rid of confusing udata parameter to rxe_cq_chk_attr
> 43c9fc509fa5 rdma_rxe: make rxe work over 802.1q VLAN devices
> befd8d98f230 IB/rxe: change the function rxe_init_device_param type
> 31f1bd14cbfe IB/rxe: remove unnecessary rxe in rxe_send
> 86af61764151 IB/rxe: remove unnecessary skb_clone
> a6544a624c3f RDMA/rxe: Fix an out-of-bounds read
> c33bab622d47 IB/rxe: Remove unused variable (char *rxe_qp_state_name[])
> 7061f28d8a2f rxe: Do not use 'struct sockaddr' in a uapi header
> 316663c6bfeb IB/rxe: remove redudant parameter in rxe_av_fill_ip_info
> 45a290f8f6c6 IB/rxe: change the function rxe_av_fill_ip_info to void
> a402dc445dd6 IB/rxe: change the function to void from int
> 9c96f3d4ddba IB/rxe: remove unnecessary parameter in rxe_av_to_attr
> ca3d9feee6a9 IB/rxe: change the function to void from int
> 1a241db19ca9 IB/rxe: remove redudant parameter in function
> 0812ed132178 IB/rxe: Change RDMA_RXE kconfig to use select
> --
> 
> Possible suspect might be:
> --
> commit 2da36d44a9d54a2c6e1f8da1f7ccc26b0bc6cfec
> Author: Jianchao Wang <jianchao.w.wang@xxxxxxxxxx>
> Date:   Thu Apr 26 11:52:39 2018 +0800
> 
>    IB/rxe: add RXE_START_MASK for rxe_opcode IB_OPCODE_RC_SEND_ONLY_INV
> 
>    w/o RXE_START_MASK, the last_psn of IB_OPCODE_RC_SEND_ONLY_INV
>    will not be updated in update_wqe_psn, and the corresponding
>    wqe will not be acked in rxe_completer due to its last_psn is
>    zero. Finally, the other wqe will also not be able to be acked,
>    because the wqe of IB_OPCODE_RC_SEND_ONLY_INV with last_psn 0
>    is still there. This causes large amount of io timeout when
>    nvmeof is over rxe.
> 
>    Add RXE_START_MASK for IB_OPCODE_RC_SEND_ONLY_INV to fix this.
> 
>    Signed-off-by: Jianchao Wang <jianchao.w.wang@xxxxxxxxxx>
>    Reviewed-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
>    Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
> --
> 
> Another useful stable candidate would be:
> --
> commit 1661d3b0e2183ce90f6611641c350a5aa02aaa80
> Author: Alexandru Moise <00moses.alexander00@xxxxxxxxx>
> Date:   Tue May 8 11:02:02 2018 +0200
> 
>    nvmet,rxe: defer ip datagram sending to tasklet
> 
>    This addresses 3 separate problems:
> 
>    1. When using NVME over Fabrics we may end up sending IP
>    packets in interrupt context, we should defer this work
>    to a tasklet.
> 
>    [   50.939957] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:161 __local_bh_enable_ip+0x1f/0xa0
>    [   50.942602] CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G        W         4.17.0-rc3-ARCH+ #104
>    [   50.945466] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
>    [   50.948163] RIP: 0010:__local_bh_enable_ip+0x1f/0xa0
>    [   50.949631] RSP: 0018:ffff88009c183900 EFLAGS: 00010006
>    [   50.951029] RAX: 0000000080010403 RBX: 0000000000000200 RCX: 0000000000000001
>    [   50.952636] RDX: 0000000000000000 RSI: 0000000000000200 RDI: ffffffff817e04ec
>    [   50.954278] RBP: ffff88009c183910 R08: 0000000000000001 R09: 0000000000000614
>    [   50.956000] R10: ffffea00021d5500 R11: 0000000000000001 R12: ffffffff817e04ec
>    [   50.957779] R13: 0000000000000000 R14: ffff88009566f400 R15: ffff8800956c7000
>    [   50.959402] FS:  0000000000000000(0000) GS:ffff88009c180000(0000) knlGS:0000000000000000
>    [   50.961552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [   50.963798] CR2: 000055c4ec0ccac0 CR3: 0000000002209001 CR4: 00000000000606e0
>    [   50.966121] Call Trace:
>    [   50.966845]  <IRQ>
>    [   50.967497]  __dev_queue_xmit+0x62d/0x690
>    [   50.968722]  dev_queue_xmit+0x10/0x20
>    [   50.969894]  neigh_resolve_output+0x173/0x190
>    [   50.971244]  ip_finish_output2+0x2b8/0x370
>    [   50.972527]  ip_finish_output+0x1d2/0x220
>    [   50.973785]  ? ip_finish_output+0x1d2/0x220
>    [   50.975010]  ip_output+0xd4/0x100
>    [   50.975903]  ip_local_out+0x3b/0x50
>    [   50.976823]  rxe_send+0x74/0x120
>    [   50.977702]  rxe_requester+0xe3b/0x10b0
>    [   50.978881]  ? ip_local_deliver_finish+0xd1/0xe0
>    [   50.980260]  rxe_do_task+0x85/0x100
>    [   50.981386]  rxe_run_task+0x2f/0x40
>    [   50.982470]  rxe_post_send+0x51a/0x550
>    [   50.983591]  nvmet_rdma_queue_response+0x10a/0x170
>    [   50.985024]  __nvmet_req_complete+0x95/0xa0
>    [   50.986287]  nvmet_req_complete+0x15/0x60
>    [   50.987469]  nvmet_bio_done+0x2d/0x40
>    [   50.988564]  bio_endio+0x12c/0x140
>    [   50.989654]  blk_update_request+0x185/0x2a0
>    [   50.990947]  blk_mq_end_request+0x1e/0x80
>    [   50.991997]  nvme_complete_rq+0x1cc/0x1e0
>    [   50.993171]  nvme_pci_complete_rq+0x117/0x120
>    [   50.994355]  __blk_mq_complete_request+0x15e/0x180
>    [   50.995988]  blk_mq_complete_request+0x6f/0xa0
>    [   50.997304]  nvme_process_cq+0xe0/0x1b0
>    [   50.998494]  nvme_irq+0x28/0x50
>    [   50.999572]  __handle_irq_event_percpu+0xa2/0x1c0
>    [   51.000986]  handle_irq_event_percpu+0x32/0x80
>    [   51.002356]  handle_irq_event+0x3c/0x60
>    [   51.003463]  handle_edge_irq+0x1c9/0x200
>    [   51.004473]  handle_irq+0x23/0x30
>    [   51.005363]  do_IRQ+0x46/0xd0
>    [   51.006182]  common_interrupt+0xf/0xf
>    [   51.007129]  </IRQ>
> 
>    2. Work must always be offloaded to tasklet for rxe_post_send_kernel()
>    when using NVMEoF in order to solve lock ordering between neigh->ha_lock
>    seqlock and the nvme queue lock:
> 
>    [   77.833783]  Possible interrupt unsafe locking scenario:
>    [   77.833783]
>    [   77.835831]        CPU0                    CPU1
>    [   77.837129]        ----                    ----
>    [   77.838313]   lock(&(&n->ha_lock)->seqcount);
>    [   77.839550]                                local_irq_disable();
>    [   77.841377] lock(&(&nvmeq->q_lock)->rlock);
>    [   77.843222] lock(&(&n->ha_lock)->seqcount);
>    [   77.845178]   <Interrupt>
>    [   77.846298]     lock(&(&nvmeq->q_lock)->rlock);
>    [   77.847986]
>    [   77.847986]  *** DEADLOCK ***
> 
>    3. Same goes for the lock ordering between sch->q.lock and nvme queue lock:
> 
>    [   47.634271]  Possible interrupt unsafe locking scenario:
>    [   47.634271]
>    [   47.636452]        CPU0                    CPU1
>    [   47.637861]        ----                    ----
>    [   47.639285]   lock(&(&sch->q.lock)->rlock);
>    [   47.640654]                                local_irq_disable();
>    [   47.642451] lock(&(&nvmeq->q_lock)->rlock);
>    [   47.644521] lock(&(&sch->q.lock)->rlock);
>    [   47.646480]   <Interrupt>
>    [   47.647263]     lock(&(&nvmeq->q_lock)->rlock);
>    [   47.648492]
>    [   47.648492]  *** DEADLOCK ***
> 
>    Using NVMEoF after this patch seems to finally be stable, without it,
>    rxe eventually deadlocks the whole system and causes RCU stalls.
> 
>    Signed-off-by: Alexandru Moise <00moses.alexander00@xxxxxxxxx>
>    Reviewed-by: Zhu Yanjun <yanjun.zhu@xxxxxxxxxx>
>    Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx>
> --

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux