Thanks Jens, Imported those two commits, in addition to commit that reintroduced helper io_wq_current_is_worker() used by one of them. Re-tested this base and no longer see the failure. Awesome! Cheers --- Mark Date: Tue, 17 Dec 2019 14:13:37 -0700 Subject: [PATCH] io-wq: re-add io_wq_current_is_worker() This reverts commit 8cdda87a4414, we now have several use csaes for this helper. Reinstate it. Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> -----Original Message----- From: Jens Axboe <axboe@xxxxxxxxx> Sent: Tuesday, February 11, 2020 11:45 AM To: Wunderlich, Mark <mark.wunderlich@xxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx Cc: Sagi Grimberg <sagi@xxxxxxxxxxx> Subject: Re: Fault seen with io_uring and nvmf/tcp On 2/11/20 12:30 PM, Wunderlich, Mark wrote: > Posting to this mail list in hopes someone has already seen this fault before I start digging. Using the nvme-5.5-rc branch of git.infradead.org repo. > Pulled this branch and running un-modified. > Performing FIO (io_uring) test: (initiating on 8 host cores, TIME=30, RWMIX=100, BLOCK_SIZE=4k, DEPTH=32, BATCH=8), using latest version of fio. > cmd="fio --filename=/dev/nvme0n1 --time_based --runtime=$TIME > --ramp_time=10 --thread --rw=randrw --rwmixread=$RWMIX > --refill_buffers --direct=1 --ioengine=io_uring --hipri --fixedbufs > --bs=$BLOCK_SIZE --iodepth=$DEPTH --iodepth_batch_complete_min=1 > --iodepth_batch_complete_max=$DEPTH --iodepth_batch=$BATCH --numjobs=1 > --group_reporting --gtod_reduce=0 --disable_lat=0 --name=cpu3 > --cpus_allowed=3 --name=cpu5 --cpus_allowed=5 --name=cpu7 > --cpus_allowed=7 --name=cpu9 --cpus_allowed=9 --name=cpu11 > --cpus_allowed11 --name=cpu13 --cpus_allowed=13 --name=cpu15 > --cpus_allowed=15 --name=cpu17 --cpus_allowed=17 > > NVMf TCP queue configuration is 1 default queue and 101 poll queues. Connected to a single remote NVMe ram disk device. > I/O performs normally up to 30 second run, but faults just at the end. Very repeatable. > > Thanks for your time --- Mark > > [64592.841944] nvme nvme0: mapped 1/0/101 default/read/poll queues. > [64592.867003] nvme nvme0: new ctrl: NQN "testrd", addr > 192.168.0.1:4420 [64646.940588] list_add corruption. prev->next should be next (ffff9c1feb2bc7c8), but was ffff9c1ff7ee5368. (prev=ffff9c1ff7ee5468). > [64646.941149] ------------[ cut here ]------------ [64646.941150] > kernel BUG at lib/list_debug.c:28! > [64646.941360] invalid opcode: 0000 [#1] SMP PTI > [64646.941561] CPU: 82 PID: 7886 Comm: io_wqe_worker-0 Tainted: G O 5.5.0-rc2stable+ #32 > [64646.941994] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS > 1.4.9 06/29/2018 [64646.942349] RIP: 0010:__list_add_valid+0x64/0x70 > [64646.942562] Code: 48 89 fe 31 c0 48 c7 c7 40 21 17 89 e8 f9 5c c6 > ff 0f 0b 48 89 d1 48 c7 c7 e8 20 17 89 48 89 f2 48 89 c6 31 c0 e8 e0 > 5c c6 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 48 8b 07 48 b9 00 01 00 > 00 00 [64646.943442] RSP: 0018:ffffa78a49137d90 EFLAGS: 00010246 > [64646.943687] RAX: 0000000000000075 RBX: ffff9c1ff7ee5a00 RCX: > 0000000000000000 [64646.944021] RDX: 0000000000000000 RSI: > ffff9c0fffe59d28 RDI: ffff9c0fffe59d28 [64646.944356] RBP: > ffffa78a49137df8 R08: 00000000000006ad R09: ffffffff88ec3be0 > [64646.944691] R10: 000000000000000f R11: 0000000007070707 R12: > ffff9c1feb2bc600 [64646.945025] R13: ffff9c1feb2bc7c8 R14: > ffff9c1ff7ee5468 R15: ffff9c1ff7ee5a68 [64646.945360] FS: > 0000000000000000(0000) GS:ffff9c0fffe40000(0000) > knlGS:0000000000000000 [64646.945739] CS: 0010 DS: 0000 ES: 0000 CR0: > 0000000080050033 [64646.946008] CR2: 00007f4423eb7004 CR3: 000000169940a005 CR4: 00000000007606e0 [64646.946343] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [64646.946677] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [64646.947012] PKRU: 55555554 [64646.947138] Call Trace: > [64646.947260] io_issue_sqe+0x115/0xa30 [64646.947429] > io_wq_submit_work+0xb5/0x1d0 [64646.947615] > io_worker_handle_work+0x19d/0x4c0 [64646.947823] > io_wqe_worker+0xdc/0x390 [64646.947998] kthread+0xf8/0x130 > [64646.948141] ? io_wq_for_each_worker+0xb0/0xb0 [64646.948349] ? > kthread_bind+0x10/0x10 [64646.948522] ret_from_fork+0x35/0x40 I think you want to check that you have these in your tree: commit 11ba820bf163e224bf5dd44e545a66a44a5b1d7a Author: Jens Axboe <axboe@xxxxxxxxx> Date: Wed Jan 15 21:51:17 2020 -0700 io_uring: ensure workqueue offload grabs ring mutex for poll list and commit 797f3f535d59f05ad12c629338beef6cb801d19e Author: Bijan Mottahedeh <bijan.mottahedeh@xxxxxxxxxx> Date: Wed Jan 15 18:37:45 2020 -0800 io_uring: clear req->result always before issuing a read/write request -- Jens Axboe