Hi, all We are running test on ublk to verify its stability. While running tests/generic/002 in ublksrv[1] which kills ublksrv daemon while running fio at the same time, we find a null-deference error. [1] https://github.com/ming1/ubdsrv *********************** DMESG: *********************** [ 1396.090924] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1396.098770] ublk_ctrl_uring_cmd: cmd done ret 0 cmd_op 2, dev id 0 qid 65535 [ 1396.105714] #PF: supervisor write access in kernel mode [ 1396.105716] #PF: error_code(0x0002) - not-present page [ 1396.105718] PGD 800000013c762067 P4D 800000013c762067 PUD 634c8d067 PMD 0 [ 1396.132122] Oops: 0002 [#1] PREEMPT SMP PTI [ 1396.136613] CPU: 33 PID: 27341 Comm: kworker/33:2 Kdump: loaded Tainted: G S OE 5.19.0 #121 [ 1396.146382] Hardware name: Inventec K900G3-10G/B900G3, BIOS A2.20 06/23/2017 [ 1396.154069] Workqueue: events io_fallback_req_func [ 1396.159169] RIP: 0010:percpu_ref_get_many+0x23/0x30 [ 1396.164360] Code: 0f 1f 80 00 00 00 00 55 48 89 f5 53 48 89 fb e8 83 93 be ff 48 8b 03 a8 03 75 0b 65 48 01 28 5b 5d e9 b1 f5 be ff 48 8b 43 08 <f0> 48 01 28 5b 5d e9 a2 f5 be ff 66 90 41 54 49 89 d4 55 48 89 f5 [ 1396.183754] RSP: 0018:ffffae20e37c7e48 EFLAGS: 00010206 [ 1396.189301] RAX: 0000000000000000 RBX: ffff9e966730d800 RCX: 0000000000000000 [ 1396.196754] RDX: ffff9e9651ca9180 RSI: 0000000000000001 RDI: ffff9e966730d800 [ 1396.204214] RBP: 0000000000000001 R08: 61626c6c61665f6f R09: 665f7165725f6b63 [ 1396.211671] R10: 666562203a636e75 R11: 7265746920657269 R12: ffff9e965da36700 [ 1396.219143] R13: ffff9e965da36788 R14: 0000000000000000 R15: ffff9ed4bf876905 [ 1396.226607] FS: 0000000000000000(0000) GS:ffff9ed4bf840000(0000) knlGS:0000000000000000 [ 1396.235026] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1396.241110] CR2: 0000000000000000 CR3: 000000068cf9c001 CR4: 00000000003706e0 [ 1396.248587] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1396.256054] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1396.263523] Call Trace: [ 1396.266313] <TASK> [ 1396.268747] io_fallback_req_func+0x61/0x126 [ 1396.273354] process_one_work+0x1df/0x3b0 [ 1396.277705] worker_thread+0x49/0x2e0 [ 1396.281712] ? rescuer_thread+0x390/0x390 [ 1396.286060] kthread+0xe5/0x110 [ 1396.289534] ? kthread_complete_and_exit+0x20/0x20 [ 1396.294644] ret_from_fork+0x1f/0x30 [ 1396.298551] </TASK> [ 1396.301055] Modules linked in: ublk_drv(OE) tcp_diag(E) udp_diag(E) inet_diag(E) ip6_tables(E) iptable_filter(E) ebtable_nat(E) ebtables(E) binfmt_misc(E) intel_rapl_msr(E) intel_rapl_common(E) iosf_mbi(E) x86_pkg_temp_thermal(E) coretemp(E) bonding(E) tls(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) iTCO_wdt(E) crypto_simd(E) iTCO_vendor_support(E) cryptd(E) mxm_wmi(E) mei_me(E) i2c_i801(E) lpc_ich(E) pcspkr(E) sg(E) i2c_smbus(E) mfd_core(E) mei(E) acpi_ipmi(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) wmi(E) acpi_power_meter(E) ip_tables(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) drm_ttm_helper(E) ttm(E) crc32c_intel(E) drm(E) nvme(E) i2c_core(E) ixgbe(E) nvme_core(E) mdio(E) dca(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) ahci(E) libahci(E) libata(E) [last unloaded: ublk_drv] [ 1396.386209] CR2: 0000000000000000 *********************** DESCRIPTION 1: *********************** Here in percpu_ref_get_many():atomic_long_add(nr, &ref->data->count), ref->data is actually freed because io_uring's ctx->refs is freed at that time. We analyze ublk_drv.c and find a reason, assume there is only one tag per blk-mq queue: CPU 0 CPU 1 CPU2 CPU3 __ublk_rq_task_work() io_uring_cmd_done(io->cmd) ****task crash**** io_ring_exit_work() /* Now io_uring ctx can be freed since all ioucmd(only one) are done */ ublk_daemon_monitor_work() blk_mq_end_request(req, BLK_STS_IOERR); ublk_queue_rq() /* fio re-issue req because of IO error */ ubq_daemon_is_dying() /* NOT SEE PF_EXITING */ io_uring_cmd_complete_in_task(io->cmd) /* the io_uring ctx is freed! */ io_fallback_req_func() /* io_uring fallback */ percpu_ref_get_many() /* null-deref! */ *********************** DESCRIPTION 2: *********************** I describe the above data flow here again: (1) After the ioucmd are io_uring_cmd_done(), a crash happens. (2) The io_uring context is freed (at least part of it, e.g. ctx->refs) because we don't reference it anymore. (3) Then, monitor_work aborts the blk-mq request and fio re-tries because of the I/O error. (4) After that, ublk_queue_rq() is called for the re-issued request but it does not see PF_EXITING!(Since ublk_queue_rq() is called in another task different from ubq_daemon or ioucmd's task) (5) io_uring_cmd_complete_in_task() is called, and the ioucmd is still the old one because the tag does not change. (6) io_fallback_req_func(), and ref->data is NULL while calling atomic_long_add(nr, &ref->data->count). The root cause is that: (1) io_uring_cmd_complete_in_task() does not immediately try to call task_work_add(), but puts the ioucmd's task_work into a llist for batching. If task_work_add() is called immediately in ublk_queue_rq(), this situation can be avoided. (2) In ublk_drv.c:ublk_queue_rq(), check on PF_EXITING is added: if (unlikely(ubq_daemon_is_dying(ubq))) { fail: mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); return BLK_STS_IOERR; } However, this check on PF_EXITING is unsafe since ublk_queue_rq() runs in another task other than ubq_daemon(ioucmd's task). ublk_queue_rq() may not see PF_EXITING after the ubq_daemon(ioucmd's task) crashes. With this check, the above null-deref error rarely happens. But if we comment it out, We are more likely to trigger this error. *********************** HOW TO REPRODUCE: *********************** (1) Forbid check on PF_EXITING. In ublk_drv.c:ublk_queue_rq(), comment out: if (unlikely(ubq_daemon_is_dying(ubq))) { fail: mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0); return BLK_STS_IOERR; } This makes sense because this check is unsafe and we are more likely to re-produce this error without this check. (2) Then, run: make test T=generic/002 It may fail after running several times. Please point out my mistake because this bug is really complicated for me :) Regards, Zhang