Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> · Wed, 25 Jul 2018 16:28:39 -0600

On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> >  * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >    that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >    (Dave, Darrick and Jan)
> 
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path.  This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch.  This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0					CPU 1
-----					-----
ext4_punch_hole()
  ext4_break_layouts() # all pages have refcount=1
  					
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page() # elevates refcount
  
  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0					CPU 1
-----					-----
					ext4_direct_IO()
					  ... lots of layers ...
					  follow_page_pte()
					    get_page()
					    # elevates refcount of page X
ext4_punch_hole()
  ext4_break_layouts() # two pages, X & Y, have refcount == 2
    __wait_var_event() # called for page X
  					
					  __put_devmap_managed_page()
					  # drops refcount of X to 1

   # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
   # We never actually called ext4_wait_dax_page(), so 'retry' in
   # ext4_break_layouts() is still false.  Exit do/while loop in
   # ext4_break_layouts, never attempting to wait on page Y which still has an
   # elevated refcount of 2.

  truncate_pagecache_range()
   ... a few layers ...
   dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

This second race can be fixed with the patch at the end of this function,
which I think should go in, unless there is a benfit to the current retry
scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
With this patch applied I've been able to run my unit test through
thousands of iterations, where it used to failed consistently within 10 or
so.

Even so, I wonder if the real solution is to add synchronization between
the direct I/O path and {ext4,xfs}_break_layouts()?  Other ideas on how
this should be handled?

--- >8 ---

>From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <zwisler@xxxxxxxxxx>
Date: Wed, 25 Jul 2018 16:16:05 -0600
Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Note that this works around the race exposed by my unit test, but I think
that there is another race that needs to be addressed, probably with
additional synchronization added between direct I/O and
{ext4,xfs}_break_layouts().

Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
---
 fs/ext4/inode.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27e9643bc13b..fedb88104bbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 	return 0;
 }
 
-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
 {
-	*did_unlock = true;
 	up_write(&ei->i_mmap_sem);
 	schedule();
 	down_write(&ei->i_mmap_sem);
@@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct page *page;
-	bool retry;
 	int error;
 
 	if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
 		return -EINVAL;
 
 	do {
-		retry = false;
 		page = dax_layout_busy_page(inode->i_mapping);
 		if (!page)
 			return 0;
@@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
 		error = ___wait_var_event(&page->_refcount,
 				atomic_read(&page->_refcount) == 1,
 				TASK_INTERRUPTIBLE, 0, 0,
-				ext4_wait_dax_page(ei, &retry));
-	} while (error == 0 && retry);
+				ext4_wait_dax_page(ei));
+	} while (error == 0);
 
 	return error;
 }
-- 
2.14.4