Re: [regression 4.2-rc3] loop: xfstests xfs/073 deadlocked in low memory conditions

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 21 Jul 2015 15:46:12 +1000

On Tue, Jul 21, 2015 at 12:05:56AM -0400, Ming Lei wrote:
> On Mon, Jul 20, 2015 at 9:59 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > Hi Ming,
> >
> > With the recent merge of the loop device changes, I'm now seeing
> > XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
> >
> > The deadlocked is as follows:
> >
> > kloopd1: loop_queue_read_work
> >         xfs_file_iter_read
> >         lock XFS inode XFS_IOLOCK_SHARED (on image file)
> >         page cache read (GFP_KERNEL)
> >         radix tree alloc
> >         memory reclaim
> >         reclaim XFS inodes
> >         log force to unpin inodes
> >         <wait for log IO completion>
> >
> > xfs-cil/loop1: <does log force IO work>
> >         xlog_cil_push
> >         xlog_write
> >         <loop issuing log writes>
> >                 xlog_state_get_iclog_space()
> >                 <blocks due to all log buffers under write io>
> >                 <waits for IO completion>
> >
> > kloopd1: loop_queue_write_work
> >         xfs_file_write_iter
> >         lock XFS inode XFS_IOLOCK_EXCL (on image file)
> >         <wait for inode to be unlocked>
> >
> > [The full stack traces are below].
> >
> > i.e. the kloopd, with it's split read and write work queues, has
> > introduced a dependency through memory reclaim. i.e. that writes
> > need to be able to progress for reads make progress.
> 
> This kind of change just makes READ vs READ OR WRITE submitted
> to fs concurrently, and the use case should have been simulated from
> user space on one regular XFS file too?

Assuming the "regular XFS file" is on a normal block device (i.e.
not a loop device) then this will not deadlock as there is not
dependency on vfs level locking for log writes.

i.e. normal userspace IO path is:

userspace read
	vfs_read
	  xfs_read
	    page cache alloc (GFP_KERNEL)
	      direct reclaim
	        xfs_inode reclaim
		  log force
		    CIL push
		      <workqueue>
		        xlog_write
			  submit_bio
			    -> hardware.

And then the log IO completes, and everything continues onward.

What the loop device used to do:

userspace read
	vfs_read
	  xfs_read
	    page cache alloc (GFP_KERNEL)
	    submit_bio
	      loop device
	        splice_read (on image file)
		  xfs_splice_read
		    page cache alloc (GFP_NOFS)
		      direct reclaim
		        <skip filesystem reclaim>
		      submit_bio
		        -> hardware.

And when the read Io completes, everything moves onwards.

What the loop device now does:

userspace read
	vfs_read
	  xfs_read
	    page cache alloc (GFP_KERNEL)
	    submit_bio
	      loop device
	        <workqueue>
	        vfs_read (on image file)
		  xfs_read
		    page cache alloc (GFP_KERNEL)
		      direct reclaim
			xfs_inode reclaim
			  log force
			    CIL push
			      <workqueue>
				xlog_write
				  submit_bio
				    loop device
				    <workqueue>
				      vfs_write (on image file)
				        xfs_write
					  <deadlock on image file lock>

> > The problem, fundamentally, is that mpage_readpages() does a
> > GFP_KERNEL allocation, rather than paying attention to the inode's
> > mapping gfp mask, which is set to GFP_NOFS.
> 
> That looks the root cause, and I guess the issue is just triggered
> after commit aa4d86163e4(block: loop: switch to VFS ITER_BVEC)
> which changes splice to bvec iterator.

Yup - you are the unfortunate person who has wandered into the
minefield I'd been telling people about for quite some time. :(

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs