Re: [PATCH v2 06/17] mm: doc comment for scary spot in write_one_page

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 12 Apr 2017 18:55:48 -0400

On Thu, 2017-04-13 at 07:36 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
> 
> > On Wed, 2017-04-12 at 07:38 -0700, Matthew Wilcox wrote:
> > > On Wed, Apr 12, 2017 at 09:01:34AM -0400, Jeff Layton wrote:
> > > > On Wed, 2017-04-12 at 08:06 -0400, Jeff Layton wrote:
> > > > > Not sure what to do here just yet.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
> > > > > ---
> > > > >  mm/page-writeback.c | 6 ++++++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > > > index de0dbf12e2c1..3ac8399dc984 100644
> > > > > --- a/mm/page-writeback.c
> > > > > +++ b/mm/page-writeback.c
> > > > > @@ -2388,6 +2388,12 @@ int write_one_page(struct page *page)
> > > > >  		ret = mapping->a_ops->writepage(page, &wbc);
> > > > >  		if (ret == 0) {
> > > > >  			wait_on_page_writeback(page);
> > > > > +			/*
> > > > > +			 * FIXME: is this racy? What guarantees that PG_error
> > > > > +			 * will still be set once we get around to checking it?
> > > > > +			 * What if writeback fails, but then a read is issued
> > > > > +			 * before we check this, and that calls ClearPageError?
> > > > > +			 */
> > > > >  			if (PageError(page))
> > > > >  				ret = -EIO;
> > > > >  		}
> > > > 
> > > > Ahh, we are always under the page lock here, and this is generally used
> > > > for writing out directory pages anyway. I'm fine with dropping this
> > > > patch unless someone else sees a problem here.
> > > 
> > > ->writepage drops the page lock.  We're still holding a refcount on this
> > > page, but that's not going to prevent read being called.  But maybe the
> > > filesystem won't call read on a page that's marked as PageError?
> > 
> > Hard to be sure there. I really wonder if that check is needed at all,
> > the more I look at it. After all, we are calling writepage with
> > WB_SYNC_ALL so we should get an error there.
> 
> WB_SYNC_ALL doesn't cause writepage to wait.  It might case it to ask
> for REQ_SYNC, so the write requests gets priority in the block layer.
> WB_SYNC_ALL does cause writepages (with an 's') to wait.
> (At least, that is how I read the code).
> 

Yeah, I realized that after I sent it. If it waited, we wouldn't need
to wait_on_page_writeback there.

> > 
> > Is it also possible these pages could be written back before that point
> > (due to memory pressure or something) and that fail?
> 
> Probably, in which case clear_page_dirty_for_io() will fail and
> write_one_page() will just unlock the page.
> 

Right. So we're already potentially missing errors here for writeback
if it happened to occur before we called clear_page_dirty_for_io.

> > 
> > Maybe we should just have a call to filemap_check_errors on exiting
> > this function?
> 
> I'm leaning in that direction.
> 
> > With the the wb_err_t based stuff, we could change it to sample the
> > wb_err early, and then use that to see if an error has occurred since
> > then. Maybe we should even allow callers to pass a wb_err_t in here, so
> > we can report errors that have occurred since a known point?
> 
> That feels to me like over-engineering.  We would need to
> unconditionally call writepage() for that to work.
> 
> We seem to be agreed that write errors for buffered writes are reported
> per-address-space.  To get per-page errors you have to use direct IO.
> Let's focus on that policy and make it work.
>

That makes sense. I don't think we'd need to always call writepage
there though. We just need to call filemap_check_wb_error even in the
case where writepage wasn't called. IOW, we want to know if there was a
writeback error on the inode, even if it occurred before we tried to
clear the dirty flag.

So, with the new API, we just have to ensure that we sample the
wb_err_t at an earlier time, and then pass that value into
write_one_page so that it can use it to detect the errors since then.

For something like JFS (which apparently uses this function for its fs
journal), the "since" value could be the value at the time that the
journal was last flushed. Then you know whether something failed
regardless of whether it was synced out to disk before this function
was called.

Most of the other callers use this for synchronous directory changes,
so those might want to keep a wb_err_t value in their inode structures,
depending on what they're looking for there. Again, I think this is a
place where we need some input from the fs maintainers.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>