Re: Broken dio refcounting leads to livelock?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> Hi Christoph,
> 
> I have a question about refcounting in struct iomap_dio:
> 
> As I mentioned late last year, I've been seeing a livelock since the
> early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> relevant part of the function starts around line 1910:

Looking at iomap_dio_bio_end_io(), it has completion code that runs
only when dio->ref hits zero. i.e. it assumes that if it sees a zero
refcount, it holds the last reference and can free dio. Otherwise it
doesn't touch dio.

> 
> 	blk_finish_plug(&plug);
> 
> 	if (ret < 0)
> 		iomap_dio_set_error(dio, ret);
> 
> 	/*
> 	 * If all the writes we issued were FUA, we don't need to flush the
> 	 * cache on IO completion. Clear the sync flag for this case.
> 	 */
> 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> 
> 	if (!atomic_dec_and_test(&dio->ref)) {

So we've dropped the IO submission reference to the dio here, which
means the onl remaining references are the bio references. We run
this code if there are remaining bio references, which given the
above it means that dio can be freed at any time after this by IO
completion. That means it's never safe to reference dio after this.

Yup, that's a use after free.

> 		int clang = 0;
> 
> 		if (!dio->wait_for_completion)
> 			return -EIOCBQUEUED;

Ok, dio->wait_for_completion is unchanging at this point, so it
doesn't matter if we do this check before or after we drop the
dio->ref. That simplifies things.

> 
> 		for (;;) {
> 			set_current_state(TASK_UNINTERRUPTIBLE);
> 			if (!READ_ONCE(dio->submit.waiter))
> 				break;
> 
> 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> 			    !dio->submit.last_queue ||
> 			    !blk_poll(dio->submit.last_queue,
> 					 dio->submit.cookie, true))
> 				io_schedule(); <--------- here
> 		}
> 		__set_current_state(TASK_RUNNING);

This is an IO wait, which means we do not want to free the dio
structure until after we've been woken. And that also means we
are going to be running iomap_dio_complete() below, so we /must/
hold on to the dio reference here. Which means we need to change
the iomap_dio_bio_end_io() code because it only triggers a wakeup
if it's dropping the last dio->ref.

OK, so something like the patch below?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

iomap: fix iomap dio reference counts

From: Dave Chinner <dchinner@xxxxxxxxxx>

The iomap dio structure could be referenced after contexts had
dropped their reference, resulting in use-after free conditions
being created. Rework the reference counting to ensure that we never
access the dio structure without having a valid reference count or
having guaranteed that the context holds the last reference will
free it.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/iomap.c | 75 +++++++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 55 insertions(+), 20 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index a3088fae567b..ee71e2ec93d4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 	if (bio->bi_status)
 		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
 
-	if (atomic_dec_and_test(&dio->ref)) {
-		if (dio->wait_for_completion) {
+	/*
+	 * If we have a waiter, then this is a sync IO and the waiter will still
+	 * hold a reference to the dio. If this is the last IO reference, we
+	 * cannot reference the dio after we drop the reference as the waiter
+	 * may be woken immediately and free the dio before we are don with it.
+	 * Hence we check for two references, do the wakeup, then drop the final
+	 * IO reference. The ordering of clearing the submit.waiter, waking the
+	 * waiter and then dropping the reference matches the order of checks in
+	 * the wait loop to avoid wakeup races.
+	 */
+	if (dio->wait_for_completion) {
+		if (atomic_read(&dio->ref) == 2) {
 			struct task_struct *waiter = dio->submit.waiter;
 			WRITE_ONCE(dio->submit.waiter, NULL);
 			blk_wake_io_task(waiter);
-		} else if (dio->flags & IOMAP_DIO_WRITE) {
+		}
+		atomic_dec(&dio->ref);
+	} else if (atomic_dec_and_test(&dio->ref)) {
+		/*
+		 * There is no waiter so we are finishing async IO. If the
+		 * refcount drops to zero then we are responsible for running
+		 * the appropriate IO completion to finish and free the dio
+		 * context.
+		 */
+		if (dio->flags & IOMAP_DIO_WRITE) {
 			struct inode *inode = file_inode(dio->iocb->ki_filp);
 
 			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
@@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
 			iomap_dio_complete_work(&dio->aio.work);
 		}
 	}
+	/* not safe to use dio past this point */
 
 	if (should_dirty) {
 		bio_check_pages_dirty(bio);
@@ -1916,27 +1936,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
 
-	if (!atomic_dec_and_test(&dio->ref)) {
-		if (!dio->wait_for_completion)
-			return -EIOCBQUEUED;
+	/*
+	 * If this is async IO, and we drop the last reference here we need
+	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
+	 * checks is important here because we can be racing with Io completion
+	 * to drop the last reference and free the dio, so we must check the dio
+	 * state before we drop the reference to avoid use-after-free
+	 * situations.
+	 */
+	if (!dio->wait_for_completion) {
+		if (atomic_dec_and_test(&dio->ref))
+			return iomap_dio_complete(dio);
+		return -EIOCBQUEUED;
+	}
 
-		for (;;) {
-			set_current_state(TASK_UNINTERRUPTIBLE);
-			if (!READ_ONCE(dio->submit.waiter))
-				break;
+	/*
+	 * This is sync IO and we have to access the dio structure to determine
+	 * if we need to wait and/or poll the block device for completion. hence
+	 * we need to hold on to our reference until IO completion has dropped
+	 * all the IO references and woken us before we drop our reference and
+	 * complete the IO.
+	 */
+	while (atomic_read(&dio->ref) > 1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (!READ_ONCE(dio->submit.waiter))
+			break;
 
-			if (!(iocb->ki_flags & IOCB_HIPRI) ||
-			    !dio->submit.last_queue ||
-			    !blk_poll(dio->submit.last_queue,
-					 dio->submit.cookie, true))
-				io_schedule();
-		}
-		__set_current_state(TASK_RUNNING);
+		if (!(iocb->ki_flags & IOCB_HIPRI) ||
+		    !dio->submit.last_queue ||
+		    !blk_poll(dio->submit.last_queue,
+				 dio->submit.cookie, true))
+			io_schedule();
 	}
+	__set_current_state(TASK_RUNNING);
+	atomic_dec(&dio->ref);
 
-	ret = iomap_dio_complete(dio);
-
-	return ret;
+	return iomap_dio_complete(dio);
 
 out_free_dio:
 	kfree(dio);



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux