Hi Christoph, I have a question about refcounting in struct iomap_dio: As I mentioned late last year, I've been seeing a livelock since the early 4.20-rcX era in iomap_dio_rw when running generic/323. The relevant part of the function starts around line 1910: blk_finish_plug(&plug); if (ret < 0) iomap_dio_set_error(dio, ret); /* * If all the writes we issued were FUA, we don't need to flush the * cache on IO completion. Clear the sync flag for this case. */ if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; if (!atomic_dec_and_test(&dio->ref)) { int clang = 0; if (!dio->wait_for_completion) return -EIOCBQUEUED; for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (!READ_ONCE(dio->submit.waiter)) break; if (!(iocb->ki_flags & IOCB_HIPRI) || !dio->submit.last_queue || !blk_poll(dio->submit.last_queue, dio->submit.cookie, true)) io_schedule(); <--------- here } __set_current_state(TASK_RUNNING); } ret = iomap_dio_complete(dio); I finally had time to look into this, and I noticed garbage values for dio->submit.waiter in *dio right before we call io_schedule. I suspect this is a use-after-free, so I altered iomap_dio_complete to memset the entire structure to 0x5C just prior to kfreeing the *dio, and sure enough I saw 0x5C in all the dio fields right before the io_schedule call. Next I instrumented all the places where we access dio->ref to see what's going on, and observed the following sequence: 1. ref == 2 just before the blk_finish_plug call. dio->wait_for_completion == false. 2. ref == 2 just before the "if (!atomic_dec_and_test(...))" 3. ref == 1 just after the "if (!atomic_dec_and_test(...))" Next we apparently end up in iomap_dio_bio_end_io: 4. ref == 1 just before the "if (atomic_dec_and_test(...))" 5. ref == 0 just after the "if (atomic_dec_and_test(...))" 6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after poisoning it with 0x5C as described above. Then we jump back to iomap_dio_rw, and: 7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)" However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly call io_schedule and never wake up. KASAN confirms this diagnosis. I noticed that only ~100us elapse between the unplug and the bio endio function being called; I've found that I can reproduce this in a qemu vm with a virtio-scsi device backed by a tmpfs file on the host. My guess is that with slower scsi device the dispatch would take long enough that iomap_dio_rw would have returned -EIOCBQUEUED long before the bio end_io function gets called and frees the dio? Anyhow, I'm not sure how to fix this. It's clear that iomap_dio_rw can't drop its ref to the dio until it's done using the dio fields. It's tempting to change the if (!atomic_dec_and_test()) to a if (atomic_read() > 1) and only drop the ref just before returning -EIOCBQUEUED or just before calling iomap_dio_complete... but that just pushes the livelock to kill_ioctx. Brain scrambled, kicking to the list to see if anyone has something smarter to say. ;) --D ================================================================== BUG: KASAN: use-after-free in iomap_dio_rw+0x1197/0x1200 Read of size 1 at addr ffff88807780e0ec by task aio-last-ref-he/2527 CPU: 2 PID: 2527 Comm: aio-last-ref-he Not tainted 5.0.0-rc1-xfsx #rc1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x7c/0xc0 print_address_description+0x6c/0x23c ? iomap_dio_rw+0x1197/0x1200 kasan_report.cold.3+0x1c/0x3a ? iomap_dio_rw+0x1197/0x1200 ? iomap_dio_rw+0x1197/0x1200 iomap_dio_rw+0x1197/0x1200 ? iomap_seek_data+0x130/0x130 ? lock_contended+0xd70/0xd70 ? lock_acquire+0x102/0x2f0 ? xfs_ilock+0x137/0x3e0 [xfs] ? xfs_file_dio_aio_read+0x10f/0x360 [xfs] ? down_read_nested+0x78/0xc0 ? xfs_file_dio_aio_read+0x123/0x360 [xfs] xfs_file_dio_aio_read+0x123/0x360 [xfs] ? __lock_acquire+0x645/0x44c0 xfs_file_read_iter+0x41a/0x7a0 [xfs] ? aio_setup_rw+0xb4/0x170 aio_read+0x29f/0x3d0 ? mark_held_locks+0x130/0x130 ? aio_prep_rw+0x960/0x960 ? kvm_clock_read+0x14/0x30 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x1b0 ? find_held_lock+0x33/0x1c0 ? lock_acquire+0x102/0x2f0 ? lock_downgrade+0x590/0x590 io_submit_one+0xf4e/0x17e0 ? __x64_sys_io_setup+0x2f0/0x2f0 ? sched_clock_cpu+0x18/0x1b0 ? find_held_lock+0x33/0x1c0 ? lock_downgrade+0x590/0x590 ? __x64_sys_io_submit+0x199/0x430 __x64_sys_io_submit+0x199/0x430 ? __ia32_compat_sys_io_submit+0x410/0x410 ? trace_hardirqs_on_thunk+0x1a/0x1c ? do_syscall_64+0x8f/0x3c0 do_syscall_64+0x8f/0x3c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f3589847687 Code: 00 00 00 49 83 38 00 75 ed 49 83 78 08 00 75 e6 8b 47 0c 39 47 08 75 de 31 c0 c3 0f 1f 84 00 00 00 00 00 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 b8 d2 00 00 00 0f 05 c3 0f 1f 84 00 00 RSP: 002b:00007f356d7f9818 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1 RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f3589847687 RDX: 00007f356d7f98c0 RSI: 0000000000000001 RDI: 00007f3589b50000 RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007f356d7f98c0 R13: 0000000000000002 R14: 00007f356d7f9d40 R15: 00000000000a0000 Allocated by task 2527: kasan_kmalloc+0xc6/0xd0 iomap_dio_rw+0x261/0x1200 xfs_file_dio_aio_read+0x123/0x360 [xfs] xfs_file_read_iter+0x41a/0x7a0 [xfs] aio_read+0x29f/0x3d0 io_submit_one+0xf4e/0x17e0 __x64_sys_io_submit+0x199/0x430 do_syscall_64+0x8f/0x3c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2519: __kasan_slab_free+0x133/0x180 kfree+0xe4/0x2d0 iomap_dio_complete+0x37b/0x6e0 iomap_dio_complete_work+0x52/0x80 iomap_dio_bio_end_io+0x50e/0x7a0 blk_update_request+0x234/0x8b0 scsi_end_request+0x7f/0x6b0 scsi_io_completion+0x18c/0x14a0 blk_done_softirq+0x258/0x390 __do_softirq+0x228/0x8de The buggy address belongs to the object at ffff88807780e0c0 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 44 bytes inside of 128-byte region [ffff88807780e0c0, ffff88807780e140) The buggy address belongs to the page: page:ffffea0001de0380 count:1 mapcount:0 mapping:ffff88802dc03500 index:0x0 flags: 0x4fff80000000200(slab) raw: 04fff80000000200 dead000000000100 dead000000000200 ffff88802dc03500 raw: 0000000000000000 0000000080150015 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88807780df80: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc ffff88807780e000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88807780e080: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff88807780e100: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc ffff88807780e180: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint