Re: Broken dio refcounting leads to livelock?

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Tue, 8 Jan 2019 12:08:03 -0800

On Tue, Jan 08, 2019 at 06:46:44PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 04:26:33PM -0800, Darrick J. Wong wrote:
> > Hi Christoph,
> > 
> > I have a question about refcounting in struct iomap_dio:
> > 
> > As I mentioned late last year, I've been seeing a livelock since the
> > early 4.20-rcX era in iomap_dio_rw when running generic/323.  The
> > relevant part of the function starts around line 1910:
> 
> Looking at iomap_dio_bio_end_io(), it has completion code that runs
> only when dio->ref hits zero. i.e. it assumes that if it sees a zero
> refcount, it holds the last reference and can free dio. Otherwise it
> doesn't touch dio.
> 
> > 
> > 	blk_finish_plug(&plug);
> > 
> > 	if (ret < 0)
> > 		iomap_dio_set_error(dio, ret);
> > 
> > 	/*
> > 	 * If all the writes we issued were FUA, we don't need to flush the
> > 	 * cache on IO completion. Clear the sync flag for this case.
> > 	 */
> > 	if (dio->flags & IOMAP_DIO_WRITE_FUA)
> > 		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
> > 
> > 	if (!atomic_dec_and_test(&dio->ref)) {
> 
> So we've dropped the IO submission reference to the dio here, which
> means the onl remaining references are the bio references. We run
> this code if there are remaining bio references, which given the
> above it means that dio can be freed at any time after this by IO
> completion. That means it's never safe to reference dio after this.
> 
> Yup, that's a use after free.
> 
> > 		int clang = 0;
> > 
> > 		if (!dio->wait_for_completion)
> > 			return -EIOCBQUEUED;
> 
> Ok, dio->wait_for_completion is unchanging at this point, so it
> doesn't matter if we do this check before or after we drop the
> dio->ref. That simplifies things.
> 
> > 
> > 		for (;;) {
> > 			set_current_state(TASK_UNINTERRUPTIBLE);
> > 			if (!READ_ONCE(dio->submit.waiter))
> > 				break;
> > 
> > 			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> > 			    !dio->submit.last_queue ||
> > 			    !blk_poll(dio->submit.last_queue,
> > 					 dio->submit.cookie, true))
> > 				io_schedule(); <--------- here
> > 		}
> > 		__set_current_state(TASK_RUNNING);
> 
> This is an IO wait, which means we do not want to free the dio
> structure until after we've been woken. And that also means we
> are going to be running iomap_dio_complete() below, so we /must/
> hold on to the dio reference here. Which means we need to change
> the iomap_dio_bio_end_io() code because it only triggers a wakeup
> if it's dropping the last dio->ref.
> 
> OK, so something like the patch below?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
> 
> iomap: fix iomap dio reference counts
> 
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> The iomap dio structure could be referenced after contexts had
> dropped their reference, resulting in use-after free conditions
> being created. Rework the reference counting to ensure that we never
> access the dio structure without having a valid reference count or
> having guaranteed that the context holds the last reference will
> free it.
> 
> Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/iomap.c | 75 +++++++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 55 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index a3088fae567b..ee71e2ec93d4 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1539,12 +1539,31 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  	if (bio->bi_status)
>  		iomap_dio_set_error(dio, blk_status_to_errno(bio->bi_status));
>  
> -	if (atomic_dec_and_test(&dio->ref)) {
> -		if (dio->wait_for_completion) {
> +	/*
> +	 * If we have a waiter, then this is a sync IO and the waiter will still
> +	 * hold a reference to the dio. If this is the last IO reference, we
> +	 * cannot reference the dio after we drop the reference as the waiter
> +	 * may be woken immediately and free the dio before we are don with it.
> +	 * Hence we check for two references, do the wakeup, then drop the final
> +	 * IO reference. The ordering of clearing the submit.waiter, waking the
> +	 * waiter and then dropping the reference matches the order of checks in
> +	 * the wait loop to avoid wakeup races.
> +	 */
> +	if (dio->wait_for_completion) {
> +		if (atomic_read(&dio->ref) == 2) {
>  			struct task_struct *waiter = dio->submit.waiter;
>  			WRITE_ONCE(dio->submit.waiter, NULL);
>  			blk_wake_io_task(waiter);

This blows up because waiter == NULL, which implies that this is the
second time through this code block for a given dio.  That doesn't sound
right... (more after the traceback)

BUG: unable to handle kernel NULL pointer dereference at 0000000000000800
#PF error: [normal kernel read fault]
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 1 PID: 3438 Comm: fsstress Not tainted 5.0.0-rc1-xfsx #rc1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__lock_acquire+0x40b/0x15f0
Code: d2 48 8b bc 24 98 00 00 00 65 48 33 3c 25 28 00 00 00 44 89 d0 0f 85 83 0f 00 00 48 8d 65 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <49> 81 3b a0 63 70 82 41 b8 00 00 00 00 44 0f 45 c0 83 fe 01 0f 87
RSP: 0018:ffff88803ea03c80 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88803ea03d50 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000800 R12: ffff88803cd18000
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
FS:  00007f508846cb80(0000) GS:ffff88803ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000800 CR3: 000000002f9fa001 CR4: 00000000001606a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 ? __lock_is_held+0x51/0x90
 ? __lock_is_held+0x51/0x90
 ? kvm_clock_read+0x14/0x30
 ? __lock_acquire+0x26c/0x15f0
 ? kvm_clock_read+0x14/0x30
 lock_acquire+0x90/0x180
 ? try_to_wake_up+0x3f/0x560
 ? ring_buffer_unlock_commit+0x21/0x120
 _raw_spin_lock_irqsave+0x3e/0x80
 ? try_to_wake_up+0x3f/0x560
 try_to_wake_up+0x3f/0x560
 ? __trace_bprintk+0x6e/0x80
 iomap_dio_bio_end_io+0x1a6/0x1d0
 blk_update_request+0xd7/0x2f0
 scsi_end_request+0x32/0x3a0
 scsi_io_completion+0x79/0x610
 blk_done_softirq+0xb1/0xe0
 __do_softirq+0xe5/0x4b2
 ? kvm_sched_clock_read+0x5/0x10
 irq_exit+0xbc/0xe0
 call_function_single_interrupt+0xf/0x20
 </IRQ>
RIP: 0010:__list_del_entry_valid+0x3d/0x4c
Code: c8 0f 84 5a 00 00 00 48 b9 00 02 00 00 00 00 ad de 48 39 ca 0f 84 7d 00 00 00 48 8b 32 48 39 fe 0f 85 5d 00 00 00 48 8b 50 08 <48> 39 f2 0f 85 42 00 00 00 b8 01 00 00 00 c3 48 89 d1 48 c7 c7 e8
RSP: 0018:ffffc90005f2f9f0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
RAX: ffff88802f9da110 RBX: ffff888078313a00 RCX: dead000000000200
RDX: ffff88803dbd3eb0 RSI: ffff88803dbd3eb0 RDI: ffff88803dbd3eb0
RBP: ffff8880370c7800 R08: 0000000000003bac R09: 000000000000000c
R10: 0000000000000000 R11: ffff88803c586c00 R12: ffff888078942c00
R13: ffff88803dbd3eb0 R14: ffff88803c586c00 R15: ffff88803dbd3e40
 xfs_log_commit_cil+0x554/0x880 [xfs]
 ? xfs_trans_roll+0x7e/0x190 [xfs]
 __xfs_trans_commit+0xd3/0x630 [xfs]
 ? xfs_defer_trans_roll+0x16e/0x5b0 [xfs]
 ? xfs_defer_finish_noroll+0xf1/0x7e0 [xfs]
 xfs_trans_roll+0x7e/0x190 [xfs]
 xfs_defer_trans_roll+0x16e/0x5b0 [xfs]
 ? __xfs_trans_commit+0x1c3/0x630 [xfs]
 xfs_defer_finish_noroll+0xf1/0x7e0 [xfs]
 ? xfs_iomap_write_unwritten+0xf4/0x390 [xfs]
 __xfs_trans_commit+0x1c3/0x630 [xfs]
 xfs_iomap_write_unwritten+0xf4/0x390 [xfs]
 iomap_dio_complete+0x3c/0x130
 ? iomap_dio_rw+0x404/0x520
 iomap_dio_rw+0x3e7/0x520
 ? xfs_file_dio_aio_write+0x132/0x3d0 [xfs]
 xfs_file_dio_aio_write+0x132/0x3d0 [xfs]
 xfs_file_write_iter+0xf2/0x1d0 [xfs]
 __vfs_write+0x1a1/0x200
 vfs_write+0xba/0x1c0
 ksys_write+0x52/0xc0
 do_syscall_64+0x4a/0x150
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f5087c3c281
Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
RSP: 002b:00007fff0879c828 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f5087c3c281
RDX: 0000000000019000 RSI: 00005585ac58a200 RDI: 0000000000000003
RBP: 00000000001dd000 R08: 0000000000000007 R09: 0000000000000000
R10: 00005585ac54a010 R11: 0000000000000246 R12: 0000000000000254
R13: 0000000000019000 R14: 00005585ac58a200 R15: 0000000000000000
Modules linked in: xfs libcrc32c bfq dax_pmem device_dax nd_pmem sch_fq_codel ip_tables x_tables nfsv4 af_packet

Oh sh*t, we're committing the unwritten extent conversion before the
disk write even finishes!

I added some tracepoints to see what's going on with the dio here:
The iomap_dio_bio_end_io event is right at the top of the function.

dio0 is right after we set dio->submit.waiter = current
dio1 is right before we call blk_finish_plug
dio2 is right before we check !dio->wait_for_completion
dio3 is just before the final iomap_dio_complete

fsstress-3438    1.... 5060291us : iomap_dio_rw: dio0 xffff88802a88cc00 ref 1 wfc 1 waiter xffff88803cd18000
fsstress-3437    1..s1 5061648us : iomap_dio_bio_end_io: dio xffff88802a88cc00 ref 2 wfc 1 waiter xffff88803cd18000
fsstress-3438    1.... 5062366us : iomap_dio_rw: dio1 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1.... 5062388us : iomap_dio_rw: dio2 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1.... 5062388us : iomap_dio_rw: dio3 xffff88802a88cc00 ref 3 wfc 1 waiter x0
fsstress-3438    1..s1 5062494us : iomap_dio_bio_end_io: dio xffff88802a88cc00 ref 2 wfc 1 waiter x0

It looks like we're issuing two bios to satisfy a particular dio.
However, the first bio completes before the submitter calls finish_plug??

I guess this means that blk_start_plug no longer plugs up io requests,
which means that the end_io function tries to wake up the submitter
before it's even had a chance to issue the second bio.

This is surprising to me, because I was under the impression that
blk_{start,finish}_plug held all the bios so there was no chance that
any of them would issue (let alone call end_io) before finish_plug.

/sys/block/sda/queue/scheduler = [bfq] none
/sys/block/sdf/queue/scheduler = [bfq] none

Changing the scheduler to none doesn't help.

However, elevating &dio->ref for the duration of the plug does seem to
work around this problem...

--D

> -		} else if (dio->flags & IOMAP_DIO_WRITE) {
> +		}
> +		atomic_dec(&dio->ref);
> +	} else if (atomic_dec_and_test(&dio->ref)) {
> +		/*
> +		 * There is no waiter so we are finishing async IO. If the
> +		 * refcount drops to zero then we are responsible for running
> +		 * the appropriate IO completion to finish and free the dio
> +		 * context.
> +		 */
> +		if (dio->flags & IOMAP_DIO_WRITE) {
>  			struct inode *inode = file_inode(dio->iocb->ki_filp);
>  
>  			INIT_WORK(&dio->aio.work, iomap_dio_complete_work);
> @@ -1553,6 +1572,7 @@ static void iomap_dio_bio_end_io(struct bio *bio)
>  			iomap_dio_complete_work(&dio->aio.work);
>  		}
>  	}
> +	/* not safe to use dio past this point */
>  
>  	if (should_dirty) {
>  		bio_check_pages_dirty(bio);
> @@ -1916,27 +1936,42 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (dio->flags & IOMAP_DIO_WRITE_FUA)
>  		dio->flags &= ~IOMAP_DIO_NEED_SYNC;
>  
> -	if (!atomic_dec_and_test(&dio->ref)) {
> -		if (!dio->wait_for_completion)
> -			return -EIOCBQUEUED;
> +	/*
> +	 * If this is async IO, and we drop the last reference here we need
> +	 * to complete the IO, otherwise we return EIOCBQUEUED. The order of
> +	 * checks is important here because we can be racing with Io completion
> +	 * to drop the last reference and free the dio, so we must check the dio
> +	 * state before we drop the reference to avoid use-after-free
> +	 * situations.
> +	 */
> +	if (!dio->wait_for_completion) {
> +		if (atomic_dec_and_test(&dio->ref))
> +			return iomap_dio_complete(dio);
> +		return -EIOCBQUEUED;
> +	}
>  
> -		for (;;) {
> -			set_current_state(TASK_UNINTERRUPTIBLE);
> -			if (!READ_ONCE(dio->submit.waiter))
> -				break;
> +	/*
> +	 * This is sync IO and we have to access the dio structure to determine
> +	 * if we need to wait and/or poll the block device for completion. hence
> +	 * we need to hold on to our reference until IO completion has dropped
> +	 * all the IO references and woken us before we drop our reference and
> +	 * complete the IO.
> +	 */
> +	while (atomic_read(&dio->ref) > 1) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		if (!READ_ONCE(dio->submit.waiter))
> +			break;
>  
> -			if (!(iocb->ki_flags & IOCB_HIPRI) ||
> -			    !dio->submit.last_queue ||
> -			    !blk_poll(dio->submit.last_queue,
> -					 dio->submit.cookie, true))
> -				io_schedule();
> -		}
> -		__set_current_state(TASK_RUNNING);
> +		if (!(iocb->ki_flags & IOCB_HIPRI) ||
> +		    !dio->submit.last_queue ||
> +		    !blk_poll(dio->submit.last_queue,
> +				 dio->submit.cookie, true))
> +			io_schedule();
>  	}
> +	__set_current_state(TASK_RUNNING);
> +	atomic_dec(&dio->ref);
>  
> -	ret = iomap_dio_complete(dio);
> -
> -	return ret;
> +	return iomap_dio_complete(dio);
>  
>  out_free_dio:
>  	kfree(dio);