Re: Broken dio refcounting leads to livelock?

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 7 Jan 2019 19:03:04 -0800

On Mon, Jan 07, 2019 at 05:16:05PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> > Hi Christoph,
> > 
> > I have a question about refcounting in struct iomap_dio:
> > 
> > As I mentioned late last year, I've been seeing a livelock since the
> > early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> > relevant part of the function starts around line 1910:
> > 
> > 	blk_finish_plug(&plug);
> > 
> > 	if (ret < 0)
> > 		iomap_dio_set_error(dio, ret);
> > 
> > 	/*
> > 	 * If all the writes we issued were FUA, we don't need to flush the
> > 	 * cache on IO completion. Clear the sync flag for this case.
> > 	 */
> > 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> > 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> > 
> > 	if (!atomic_dec_and_test(&dio->ref)) {
> > 		int clang = 0;
> > 
> > 		if (!dio->wait_for_completion)
> > 			return -EIOCBQUEUED;
> > 
> > 		for (;;) {
> > 			set_current_state(TASK_UNINTERRUPTIBLE);
> > 			if (!READ_ONCE(dio->submit.waiter))
> > 				break;
> > 
> > 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> > 			    !dio->submit.last_queue ||
> > 			    !blk_poll(dio->submit.last_queue,
> > 					 dio->submit.cookie, true))
> > 				io_schedule(); <--------- here
> > 		}
> > 		__set_current_state(TASK_RUNNING);
> > 	}
> > 
> > 	ret = iomap_dio_complete(dio);
> > 
> > I finally had time to look into this, and I noticed garbage values for
> > dio->submit.waiter in *dio right before we call io_schedule.  I suspect
> > this is a use-after-free, so I altered iomap_dio_complete to memset the
> > entire structure to 0x5C just prior to kfreeing the *dio, and sure
> > enough I saw 0x5C in all the dio fields right before the io_schedule
> > call.
> > 
> > Next I instrumented all the places where we access dio->ref to see
> > what's going on, and observed the following sequence:
> > 
> >  1. ref == 2 just before the blk_finish_plug call.
> >     dio->wait_for_completion == false.
> >  2. ref == 2 just before the "if (!atomic_dec_and_test(...))"
> >  3. ref == 1 just after the "if (!atomic_dec_and_test(...))"
> > 
> > Next we apparently end up in iomap_dio_bio_end_io:
> > 
> >  4. ref == 1 just before the "if (atomic_dec_and_test(...))"
> >  5. ref == 0 just after the "if (atomic_dec_and_test(...))"
> >  6. iomap_dio_bio_end_io calls iomap_dio_complete, frees the dio after
> >     poisoning it with 0x5C as described above.
> > 
> > Then we jump back to iomap_dio_rw, and:
> > 
> >  7. ref = 0x5c5c5c5c just before the "if (!(iocb->ki_flags & IOCB_HIPRI)...)"
> >     However, dio->wait_for_completion is 0x5c5c5c5c so we incorrectly
> >     call io_schedule and never wake up.
> > 
> > KASAN confirms this diagnosis.  I noticed that only ~100us elapse
> > between the unplug and the bio endio function being called; I've found
> > that I can reproduce this in a qemu vm with a virtio-scsi device backed
> > by a tmpfs file on the host.  My guess is that with slower scsi device
> > the dispatch would take long enough that iomap_dio_rw would have
> > returned -EIOCBQUEUED long before the bio end_io function gets called
> > and frees the dio?
> > 
> > Anyhow, I'm not sure how to fix this.  It's clear that iomap_dio_rw
> > can't drop its ref to the dio until it's done using the dio fields.
> > It's tempting to change the if (!atomic_dec_and_test()) to a if
> > (atomic_read() > 1) and only drop the ref just before returning
> > -EIOCBQUEUED or just before calling iomap_dio_complete... but that just
> > pushes the livelock to kill_ioctx.
> 
> No it doesn't, silly me just had debugging crapola all over iomap.c.
> 
> > Brain scrambled, kicking to the list to see if anyone has something
> > smarter to say. ;)
> 
> I have a gross brainscrambled patch that makes the regression go away.
> No idea what other subtle breakage this might cause...
> 
> --D
> 
> ---
>  fs/iomap.c |   22 ++++++++++++++--------
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a3088fae567b..cdea2a83ef78 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1535,16 +1535,18 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  {
>  	struct iomap_dio *dio = bio->bi_private;
>  	bool should_dirty = (dio->flags & IOMAP_DIO_DIRTY);
> +	int dio_ref = atomic_dec_return(&dio->ref);
>  
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> -	if (atomic_dec_and_test(&dio->ref)) {
> -		if (dio->wait_for_completion) {
> -			struct task_struct *waiter = dio->submit.waiter;
> -			WRITE_ONCE(dio->submit.waiter, NULL);
> -			blk_wake_io_task(waiter);
> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> +	if (dio_ref == 1 && dio->wait_for_completion) {
> +		struct task_struct *waiter = dio->submit.waiter;
> +
> +		WRITE_ONCE(dio->submit.waiter, NULL);
> +		blk_wake_io_task(waiter);
> +	} else if (dio_ref == 0) {
> +		if (dio->flags & IOMAP_DIO_WRITE) {

Heh, no, this doesn't actually fix the data race... :P

...I'm stopping for the night, maybe one of you have fresher eyes.

--D

>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> @@ -1916,9 +1918,12 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
>  
> -	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!dio->wait_for_completion)
> +	if (atomic_read(&dio->ref) > 1) {
> +		if (!dio->wait_for_completion) {
> +			if (atomic_dec_and_test(&dio->ref))
> +				return iomap_dio_complete(dio);
>  			return -EIOCBQUEUED;
> +		}
>  
>  		for (;;) {
>  			set_current_state(TASK_UNINTERRUPTIBLE);
> @@ -1934,6 +1939,7 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		__set_current_state(TASK_RUNNING);
>  	}
>  
> +	atomic_dec(&dio->ref);
>  	ret = iomap_dio_complete(dio);
>  
>  	return ret;