Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes

Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> · Thu, 07 Jan 2010 20:37:31 -0500

On Fri, 2010-01-08 at 09:17 +0800, Wu Fengguang wrote: 
> On Thu, Jan 07, 2010 at 11:10:22PM +0800, Trond Myklebust wrote:
> > On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: 
> > > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote:
> > > 
> > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > > index d171696..910be28 100644
> > > > > > --- a/fs/nfs/write.c
> > > > > > +++ b/fs/nfs/write.c
> > > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> > > > > >     spin_unlock(&inode->i_lock);
> > > > > >     inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> > > > > >     inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > > > > -   __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > > > +   mark_inode_unstable_pages(inode);
> > > > >
> > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size.
> > > > 
> > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after
> > > > this patch is applied. We won't ever need it.
> > > > 
> > > > If the VM or VFS is doing it, then they ought to be fixed: there is no
> > > > reason to assume that all filesystems need to sync their inodes on
> > > > i_size changes.
> > > 
> > > Sorry, one more question.
> > > 
> > > It seems to me that you are replacing
> > > 
> > >         I_DIRTY_DATASYNC => write_inode()
> > > with
> > >         I_UNSTABLE_PAGES => commit_unstable_pages()
> > > 
> > > Is that change for the sake of clarity? Or to fix some problem?
> > > (This patch does fix some problems, but do they inherently require
> > > the above change?)
> > 
> > As I said previously, the write_inode() call is done _before_ you sync
> > the dirty pages to the server, whereas commit_unstable_pages() wants to
> > be done _after_ syncing. So the two are not the same, and we cannot
> > replace commit_unstable_pages() with write_inode().
> 
> This is the ordering:
> 
> 0       do_writepages()
> 1       if (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> 2               write_inode()
> 3       if (wait)
> 4               filemap_fdatawait()
> 5       if (I_UNSTABLE_PAGES)
> 6               commit_unstable_pages()
> 
> The page is synced to NFS server in line 0.
> 
> The only difference is write_inode() is called before filemap_fdatawait(),
> while commit_unstable_pages() is called after it.
> 
> Note that filemap_fdatawait() will only be called on WB_SYNC_ALL, so I
> still cannot understand the difference..

The difference is precisely that...

In the case of WB_SYNC_ALL we want the call to filemap_fdatawait() to
occur before we call commit_unstable_pages(), so that we know that all
the in-flight write rpc calls are done before we ask that they be
committed to stable storage.

In the case of WB_SYNC_NONE, there is no wait, and so we are forced to
play games with heuristics and/or add the force_commit_unstable flag
because we don't wait for the dirty pages to be cleaned. I don't like
this, but those are the semantics that we've defined for WB_SYNC_NONE.

Cheers
  Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html