+ fsdevel and the xfs list. On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> wrote: > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > > Changes since v3: > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > > (Dave, Darrick and Jan) > > > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > > hunt them down? > > The root cause of this issue is that while the ei->i_mmap_sem provides > synchronization between ext4_break_layouts() and page faults, it doesn't > provide synchronize us with the direct I/O path. This exact same issue exists > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > while we're executing a truncate/hole punch. This leads to us trying to free > a page with an elevated refcount. > > Here's one instance of the race: > > CPU 0 CPU 1 > ----- ----- > ext4_punch_hole() > ext4_break_layouts() # all pages have refcount=1 > > ext4_direct_IO() > ... lots of layers ... > follow_page_pte() > get_page() # elevates refcount > > truncate_pagecache_range() > ... a few layers ... > dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() > > A similar race occurs when the refcount is being dropped while we're running > ext4_break_layouts(), and this is the one that my test was actually hitting: > > CPU 0 CPU 1 > ----- ----- > ext4_direct_IO() > ... lots of layers ... > follow_page_pte() > get_page() > # elevates refcount of page X > ext4_punch_hole() > ext4_break_layouts() # two pages, X & Y, have refcount == 2 > __wait_var_event() # called for page X > > __put_devmap_managed_page() > # drops refcount of X to 1 > > # __wait_var_events() checks X's refcount in "if (condition)", and breaks. > # We never actually called ext4_wait_dax_page(), so 'retry' in > # ext4_break_layouts() is still false. Exit do/while loop in > # ext4_break_layouts, never attempting to wait on page Y which still has an > # elevated refcount of 2. > > truncate_pagecache_range() > ... a few layers ... > dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() > > This second race can be fixed with the patch at the end of this function, > which I think should go in, unless there is a benfit to the current retry > scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? > With this patch applied I've been able to run my unit test through > thousands of iterations, where it used to failed consistently within 10 or > so. > > Even so, I wonder if the real solution is to add synchronization between > the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how > this should be handled? > > --- >8 --- > > From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 > From: Ross Zwisler <zwisler@xxxxxxxxxx> > Date: Wed, 25 Jul 2018 16:16:05 -0600 > Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() > > If the refcount of a page is lowered between the time that it is returned > by dax_busy_page() and when the refcount is again checked in > ext4_break_layouts() => ___wait_var_event(), the waiting function > ext4_wait_dax_page() will never be called. This means that > ext4_break_layouts() will still have 'retry' set to false, so we'll stop > looping and never check the refcount of other pages in this inode. > > Instead, always continue looping as long as dax_layout_busy_page() gives us > a page which it found with an elevated refcount. > > Note that this works around the race exposed by my unit test, but I think > that there is another race that needs to be addressed, probably with > additional synchronization added between direct I/O and > {ext4,xfs}_break_layouts(). > > Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> > --- > fs/ext4/inode.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 27e9643bc13b..fedb88104bbf 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, > return 0; > } > > -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) > +static void ext4_wait_dax_page(struct ext4_inode_info *ei) > { > - *did_unlock = true; > up_write(&ei->i_mmap_sem); > schedule(); > down_write(&ei->i_mmap_sem); > @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode) > { > struct ext4_inode_info *ei = EXT4_I(inode); > struct page *page; > - bool retry; > int error; > > if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) > return -EINVAL; > > do { > - retry = false; > page = dax_layout_busy_page(inode->i_mapping); > if (!page) > return 0; > @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode) > error = ___wait_var_event(&page->_refcount, > atomic_read(&page->_refcount) == 1, > TASK_INTERRUPTIBLE, 0, 0, > - ext4_wait_dax_page(ei, &retry)); > - } while (error == 0 && retry); > + ext4_wait_dax_page(ei)); > + } while (error == 0); > > return error; > }