On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote: > Hi Christoph, > > I have a question about refcounting in struct iomap_dio: > > As I mentioned late last year, I've been seeing a livelock since the > early 4.20-rcX era in iomap_dio_rw when running generic/323. The > relevant part of the function starts around line 1910: > > blk_finish_plug(&plug); > > if (ret < 0) > iomap_dio_set_error(dio, ret); > > /* > * If all the writes we issued were FUA, we don't need to flush the > * cache on IO completion. Clear the sync flag for this case. > */ > if (dio->flags & IOMAP_DIO_WRITE_FUA) > dio->flags &= ~IOMAP_DIO_NEED_SYNC; > > if (!atomic_dec_and_test(&dio->ref)) { > int clang = 0; > > if (!dio->wait_for_completion) > return -EIOCBQUEUED; > > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > if (!READ_ONCE(dio->submit.waiter)) > break; > > if (!(iocb->ki_flags & IOCB_HIPRI) || > !dio->submit.last_queue || > !blk_poll(dio->submit.last_queue, > dio->submit.cookie, true)) > io_schedule(); <--------- here > } > __set_current_state(TASK_RUNNING); > } > > ret = iomap_dio_complete(dio); > > I finally had time to look into this, and I noticed garbage values for > dio->submit.waiter in *dio right before we call io_schedule. I suspect > this is a use-after-free, so I altered iomap_dio_complete to memset the > entire structure to 0x5C just prior to kfreeing the *dio, and sure > enough I saw 0x5C in all the dio fields right before the io_schedule > call. > > Next I instrumented all the places where we access dio->ref to see > what's going on, and observed the following sequence: > > 1. ref == 2 just before the blk_finish_plug call. > dio->wait_for_completion == false. > 2. ref == 2 just before the "if (!atomic_dec_and_test(...))" > 3. ref == 1 just after the "if (!atomic_dec_and_test(...))" > > Next we apparently end up in iomap_dio_bio_end_io: > > 4. ref == 1 just before the "if (atomic_dec_and_test(...))" > 5. ref == 0 just after the "if (atomic_dec_and_test(...))" > 6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after > poisoning it with 0x5C as described above. > > Then we jump back to iomap_dio_rw, and: > > 7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)" > However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly > call io_schedule and never wake up. > > KASAN confirms this diagnosis. I noticed that only ~100us elapse > between the unplug and the bio endio function being called; I've found > that I can reproduce this in a qemu vm with a virtio-scsi device backed > by a tmpfs file on the host. My guess is that with slower scsi device > the dispatch would take long enough that iomap_dio_rw would have > returned -EIOCBQUEUED long before the bio end_io function gets called > and frees the dio? > > Anyhow, I'm not sure how to fix this. It's clear that iomap_dio_rw > can't drop its ref to the dio until it's done using the dio fields. > It's tempting to change the if (!atomic_dec_and_test()) to a if > (atomic_read() > 1) and only drop the ref just before returning > -EIOCBQUEUED or just before calling iomap_dio_complete... but that just > pushes the livelock to kill_ioctx. No it doesn't, silly me just had debugging crapola all over iomap.c. > Brain scrambled, kicking to the list to see if anyone has something > smarter to say. ;) I have a gross brainscrambled patch that makes the regression go away. No idea what other subtle breakage this might cause... --D --- fs/iomap.c | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/fs/iomap.c b/fs/iomap.c index a3088fae567b..cdea2a83ef78 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -1535,16 +1535,18 @@ static void iomap_dio_bio_end_io(struct bio *bio) { struct iomap_dio *dio = bio->bi_private; bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); + int dio_ref = atomic_dec_return(&dio->ref); if (bio->bi_status) iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); - if (atomic_dec_and_test(&dio->ref)) { - if (dio->wait_for_completion) { - struct task_struct *waiter = dio->submit.waiter; - WRITE_ONCE(dio->submit.waiter, NULL); - blk_wake_io_task(waiter); - } else if (dio->flags & IOMAP_DIO_WRITE) { + if (dio_ref == 1 && dio->wait_for_completion) { + struct task_struct *waiter = dio->submit.waiter; + + WRITE_ONCE(dio->submit.waiter, NULL); + blk_wake_io_task(waiter); + } else if (dio_ref == 0) { + if (dio->flags & IOMAP_DIO_WRITE) { struct inode *inode = file_inode(dio->iocb->ki_filp); INIT_WORK(&dio->aio.work, iomap_dio_complete_work); @@ -1916,9 +1918,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, if (dio->flags & IOMAP_DIO_WRITE_FUA) dio->flags &= ~IOMAP_DIO_NEED_SYNC; - if (!atomic_dec_and_test(&dio->ref)) { - if (!dio->wait_for_completion) + if (atomic_read(&dio->ref) > 1) { + if (!dio->wait_for_completion) { + if (atomic_dec_and_test(&dio->ref)) + return iomap_dio_complete(dio); return -EIOCBQUEUED; + } for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); @@ -1934,6 +1939,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, __set_current_state(TASK_RUNNING); } + atomic_dec(&dio->ref); ret = iomap_dio_complete(dio); return ret;