Hello, We have been hitting an error when running IO over our nvme-of setup, using the mlx5 driver and we are wondering if anyone has seen anything similar/has any suggestions. Both initiator and target are AMD EPYC 7502 machines connected over RDMA using a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a single NVMe fabrics device, one physical SSD per namespace. When running an fio job targeting directly the fabrics devices (no filesystem, see script at the end), within a minute or so we start seeing errors like this: [ 408.368677] mlx5_core 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002f address=0x24d08000 flags=0x0000] [ 408.372201] infiniband mlx5_0: mlx5_handle_error_cqe:332:(pid 0): WC error: 4, Message: local protection error [ 408.380181] infiniband mlx5_0: dump_cqe:272:(pid 0): dump error cqe [ 408.380187] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 408.380189] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 408.380191] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 408.380192] 00000030: 00 00 00 00 a9 00 56 04 00 00 01 e9 00 54 e8 e2 [ 408.380230] nvme nvme15: RECV for CQE 0x00000000ce392ed9 failed with status local protection error (4) [ 408.380235] nvme nvme15: starting error recovery [ 408.380238] nvme_ns_head_submit_bio: 726 callbacks suppressed [ 408.380246] block nvme15n2: no usable path - requeuing I/O [ 408.380284] block nvme15n5: no usable path - requeuing I/O [ 408.380298] block nvme15n1: no usable path - requeuing I/O [ 408.380304] block nvme15n11: no usable path - requeuing I/O [ 408.380304] block nvme15n11: no usable path - requeuing I/O [ 408.380330] block nvme15n1: no usable path - requeuing I/O [ 408.380350] block nvme15n2: no usable path - requeuing I/O [ 408.380371] block nvme15n6: no usable path - requeuing I/O [ 408.380377] block nvme15n6: no usable path - requeuing I/O [ 408.380382] block nvme15n4: no usable path - requeuing I/O [ 408.380472] mlx5_core 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002f address=0x24d09000 flags=0x0000] [ 408.391265] mlx5_core 0000:c1:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x002f address=0x24d0a000 flags=0x0000] [ 415.125967] nvmet: ctrl 1 keep-alive timer (5 seconds) expired! [ 415.131898] nvmet: ctrl 1 fatal error occurred! Occasionally, we've seen the following stack trace: [ 1158.152464] kernel BUG at drivers/iommu/amd/io_pgtable.c:485! [ 1158.427696] invalid opcode: 0000 [#1] SMP NOPTI [ 1158.432228] CPU: 51 PID: 796 Comm: kworker/51:1H Tainted: P OE 5.13.0-eid-athena-g6fb4e704d11c-dirty #14 [ 1158.443867] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 1158.451252] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] [ 1158.456884] RIP: 0010:iommu_v1_unmap_page+0xed/0x100 [ 1158.461849] Code: 48 8b 45 d0 65 48 33 04 25 28 00 00 00 75 1d 48 83 c4 10 4c 89 f0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 8d 46 ff 4c 85 f0 74 d6 <0f> 0b e8 1c 38 46 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 [ 1158.480589] RSP: 0018:ffffabb520587bd0 EFLAGS: 00010206 [ 1158.485812] RAX: 0001000000061fff RBX: 0000000000100000 RCX: 0000000000000027 [ 1158.492938] RDX: 0000000030562000 RSI: ffff000000000000 RDI: 0000000000000000 [ 1158.500071] RBP: ffffabb520587c08 R08: ffffabb520587bd0 R09: 0000000000000000 [ 1158.507202] R10: 0000000000000001 R11: 000ffffffffff000 R12: ffff9984abd9e318 [ 1158.514326] R13: ffff9984abd9e310 R14: 0001000000062000 R15: 0001000000000000 [ 1158.521452] FS: 0000000000000000(0000) GS:ffff99a36c8c0000(0000) knlGS:0000000000000000 [ 1158.529540] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1158.535286] CR2: 00007f75b04f1000 CR3: 00000001eddd8000 CR4: 0000000000350ee0 [ 1158.542419] Call Trace: [ 1158.544877] amd_iommu_unmap+0x2c/0x40 [ 1158.548653] __iommu_unmap+0xc4/0x170 [ 1158.552344] iommu_unmap_fast+0xe/0x10 [ 1158.556100] __iommu_dma_unmap+0x85/0x120 [ 1158.560115] iommu_dma_unmap_sg+0x95/0x110 [ 1158.564213] dma_unmap_sg_attrs+0x42/0x50 [ 1158.568225] rdma_rw_ctx_destroy+0x6e/0xc0 [ib_core] [ 1158.573201] nvmet_rdma_rw_ctx_destroy+0xa7/0xc0 [nvmet_rdma] [ 1158.578944] nvmet_rdma_read_data_done+0x5c/0xf0 [nvmet_rdma] [ 1158.584683] __ib_process_cq+0x8e/0x150 [ib_core] [ 1158.589398] ib_cq_poll_work+0x2b/0x80 [ib_core] [ 1158.594027] process_one_work+0x220/0x3c0 [ 1158.598038] worker_thread+0x4d/0x3f0 [ 1158.601696] kthread+0x114/0x150 [ 1158.604928] ? process_one_work+0x3c0/0x3c0 [ 1158.609114] ? kthread_park+0x90/0x90 [ 1158.612783] ret_from_fork+0x22/0x30 We first saw this on a 5.13 kernel but could reproduce with 5.17-rc2. We found a possibly related bug report [1] that suggested disabling the IOMMU could help, but even after I disabled it (amd_iommu=off iommu=off) I still get errors (nvme IO timeouts). Another thread from 2016[2] suggested that disabling some kernel debug options could workaround the "local protection error" but that didn't help either. As far as I can tell, the disks are fine, as running the same fio job targeting the real physical devices works fine. Any suggestions are appreciated. Thanks, Martin [1]: https://bugzilla.kernel.org/show_bug.cgi?id=210177 [2]: https://lore.kernel.org/all/6BBFD126-877C-4638-BB91-ABF715E29326@xxxxxxxxxx/ fio script: [global] name=fio-seq-write rw=write bs=1M direct=1 numjobs=32 time_based group_reporting=1 runtime=18000 end_fsync=1 size=10G ioengine=libaio iodepth=16 [file1] filename=/dev/nvme0n1 [file2] filename=/dev/nvme0n2 [file3] filename=/dev/nvme0n3 [file4] filename=/dev/nvme0n4 [file5] filename=/dev/nvme0n5 [file6] filename=/dev/nvme0n6 [file7] filename=/dev/nvme0n7 [file8] filename=/dev/nvme0n8 [file9] filename=/dev/nvme0n9 [file10] filename=/dev/nvme0n10 [file11] filename=/dev/nvme0n11 [file12] filename=/dev/nvme0n12