On Mon, Jan 07, 2019 at 05:16:05PM -0800, Darrick J. Wong wrote: > On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote: > > Hi Christoph, > > > > I have a question about refcounting in struct iomap_dio: > > > > As I mentioned late last year, I've been seeing a livelock since the > > early 4.20-rcX era in iomap_dio_rw when running generic/323. The > > relevant part of the function starts around line 1910: > > > > blk_finish_plug(&plug); > > > > if (ret < 0) > > iomap_dio_set_error(dio, ret); > > > > /* > > * If all the writes we issued were FUA, we don't need to flush the > > * cache on IO completion. Clear the sync flag for this case. > > */ > > if (dio->flags & IOMAP_DIO_WRITE_FUA) > > dio->flags &= ~IOMAP_DIO_NEED_SYNC; > > > > if (!atomic_dec_and_test(&dio->ref)) { > > int clang = 0; > > > > if (!dio->wait_for_completion) > > return -EIOCBQUEUED; > > > > for (;;) { > > set_current_state(TASK_UNINTERRUPTIBLE); > > if (!READ_ONCE(dio->submit.waiter)) > > break; > > > > if (!(iocb->ki_flags & IOCB_HIPRI) || > > !dio->submit.last_queue || > > !blk_poll(dio->submit.last_queue, > > dio->submit.cookie, true)) > > io_schedule(); <--------- here > > } > > __set_current_state(TASK_RUNNING); > > } > > > > ret = iomap_dio_complete(dio); > > > > I finally had time to look into this, and I noticed garbage values for > > dio->submit.waiter in *dio right before we call io_schedule. I suspect > > this is a use-after-free, so I altered iomap_dio_complete to memset the > > entire structure to 0x5C just prior to kfreeing the *dio, and sure > > enough I saw 0x5C in all the dio fields right before the io_schedule > > call. > > > > Next I instrumented all the places where we access dio->ref to see > > what's going on, and observed the following sequence: > > > > 1. ref == 2 just before the blk_finish_plug call. > > dio->wait_for_completion == false. > > 2. ref == 2 just before the "if (!atomic_dec_and_test(...))" > > 3. ref == 1 just after the "if (!atomic_dec_and_test(...))" > > > > Next we apparently end up in iomap_dio_bio_end_io: > > > > 4. ref == 1 just before the "if (atomic_dec_and_test(...))" > > 5. ref == 0 just after the "if (atomic_dec_and_test(...))" > > 6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after > > poisoning it with 0x5C as described above. > > > > Then we jump back to iomap_dio_rw, and: > > > > 7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)" > > However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly > > call io_schedule and never wake up. > > > > KASAN confirms this diagnosis. I noticed that only ~100us elapse > > between the unplug and the bio endio function being called; I've found > > that I can reproduce this in a qemu vm with a virtio-scsi device backed > > by a tmpfs file on the host. My guess is that with slower scsi device > > the dispatch would take long enough that iomap_dio_rw would have > > returned -EIOCBQUEUED long before the bio end_io function gets called > > and frees the dio? > > > > Anyhow, I'm not sure how to fix this. It's clear that iomap_dio_rw > > can't drop its ref to the dio until it's done using the dio fields. > > It's tempting to change the if (!atomic_dec_and_test()) to a if > > (atomic_read() > 1) and only drop the ref just before returning > > -EIOCBQUEUED or just before calling iomap_dio_complete... but that just > > pushes the livelock to kill_ioctx. > > No it doesn't, silly me just had debugging crapola all over iomap.c. > > > Brain scrambled, kicking to the list to see if anyone has something > > smarter to say. ;) > > I have a gross brainscrambled patch that makes the regression go away. > No idea what other subtle breakage this might cause... > > --D > > --- > fs/iomap.c | 22 ++++++++++++++-------- > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index a3088fae567b..cdea2a83ef78 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1535,16 +1535,18 @@ static void iomap_dio_bio_end_io(struct bio *bio) > { > struct iomap_dio *dio = bio->bi_private; > bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY); > + int dio_ref = atomic_dec_return(&dio->ref); > > if (bio->bi_status) > iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status)); > > - if (atomic_dec_and_test(&dio->ref)) { > - if (dio->wait_for_completion) { > - struct task_struct *waiter = dio->submit.waiter; > - WRITE_ONCE(dio->submit.waiter, NULL); > - blk_wake_io_task(waiter); > - } else if (dio->flags & IOMAP_DIO_WRITE) { > + if (dio_ref == 1 && dio->wait_for_completion) { > + struct task_struct *waiter = dio->submit.waiter; > + > + WRITE_ONCE(dio->submit.waiter, NULL); > + blk_wake_io_task(waiter); > + } else if (dio_ref == 0) { > + if (dio->flags & IOMAP_DIO_WRITE) { Heh, no, this doesn't actually fix the data race... :P ...I'm stopping for the night, maybe one of you have fresher eyes. --D > struct inode *inode = file_inode(dio->iocb->ki_filp); > > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > @@ -1916,9 +1918,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > if (dio->flags & IOMAP_DIO_WRITE_FUA) > dio->flags &= ~IOMAP_DIO_NEED_SYNC; > > - if (!atomic_dec_and_test(&dio->ref)) { > - if (!dio->wait_for_completion) > + if (atomic_read(&dio->ref) > 1) { > + if (!dio->wait_for_completion) { > + if (atomic_dec_and_test(&dio->ref)) > + return iomap_dio_complete(dio); > return -EIOCBQUEUED; > + } > > for (;;) { > set_current_state(TASK_UNINTERRUPTIBLE); > @@ -1934,6 +1939,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > __set_current_state(TASK_RUNNING); > } > > + atomic_dec(&dio->ref); > ret = iomap_dio_complete(dio); > > return ret;