Re: [PATCH 1/6] VFS: Ensure that writeback_single_inode() commits unstable writes

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Fri, 8 Jan 2010 09:53:19 +0800

On Fri, Jan 08, 2010 at 09:37:31AM +0800, Trond Myklebust wrote:
> On Fri, 2010-01-08 at 09:17 +0800, Wu Fengguang wrote: 
> > On Thu, Jan 07, 2010 at 11:10:22PM +0800, Trond Myklebust wrote:
> > > On Thu, 2010-01-07 at 22:56 +0800, Wu Fengguang wrote: 
> > > > On Thu, Jan 07, 2010 at 12:38:02PM +0800, Myklebust, Trond wrote:
> > > > 
> > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > > > index d171696..910be28 100644
> > > > > > > --- a/fs/nfs/write.c
> > > > > > > +++ b/fs/nfs/write.c
> > > > > > > @@ -441,7 +441,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> > > > > > >     spin_unlock(&inode->i_lock);
> > > > > > >     inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> > > > > > >     inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > > > > > -   __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > > > > > > +   mark_inode_unstable_pages(inode);
> > > > > >
> > > > > > Then we shall mark I_DIRTY_DATASYNC on other places that extend i_size.
> > > > > 
> > > > > Why? The NFS client itself shouldn't ever set I_DIRTY_DATASYNC after
> > > > > this patch is applied. We won't ever need it.
> > > > > 
> > > > > If the VM or VFS is doing it, then they ought to be fixed: there is no
> > > > > reason to assume that all filesystems need to sync their inodes on
> > > > > i_size changes.
> > > > 
> > > > Sorry, one more question.
> > > > 
> > > > It seems to me that you are replacing
> > > > 
> > > >         I_DIRTY_DATASYNC => write_inode()
> > > > with
> > > >         I_UNSTABLE_PAGES => commit_unstable_pages()
> > > > 
> > > > Is that change for the sake of clarity? Or to fix some problem?
> > > > (This patch does fix some problems, but do they inherently require
> > > > the above change?)
> > > 
> > > As I said previously, the write_inode() call is done _before_ you sync
> > > the dirty pages to the server, whereas commit_unstable_pages() wants to
> > > be done _after_ syncing. So the two are not the same, and we cannot
> > > replace commit_unstable_pages() with write_inode().
> > 
> > This is the ordering:
> > 
> > 0       do_writepages()
> > 1       if (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
> > 2               write_inode()
> > 3       if (wait)
> > 4               filemap_fdatawait()
> > 5       if (I_UNSTABLE_PAGES)
> > 6               commit_unstable_pages()
> > 
> > The page is synced to NFS server in line 0.
> > 
> > The only difference is write_inode() is called before filemap_fdatawait(),
> > while commit_unstable_pages() is called after it.
> > 
> > Note that filemap_fdatawait() will only be called on WB_SYNC_ALL, so I
> > still cannot understand the difference..
> 
> The difference is precisely that...

Thanks, I got it.

> In the case of WB_SYNC_ALL we want the call to filemap_fdatawait() to
> occur before we call commit_unstable_pages(), so that we know that all
> the in-flight write rpc calls are done before we ask that they be
> committed to stable storage.

That's good order for WB_SYNC_ALL. However this is optimizing a minor
case, and what I cared is WB_SYNC_NONE :)

> In the case of WB_SYNC_NONE, there is no wait, and so we are forced to
> play games with heuristics and/or add the force_commit_unstable flag
> because we don't wait for the dirty pages to be cleaned. I don't like
> this, but those are the semantics that we've defined for WB_SYNC_NONE.

For WB_SYNC_NONE we will now also wait for WRITE completion with the
combination of NFS_PAGE_TAG_LOCKED-based-bail-out and redirty_tail().
This is retry based, so less elegant.

But that's not the whole story. The
I_UNSTABLE_PAGES+commit_unstable_pages() scheme seems elegant for
WB_SYNC_ALL, however it may break the pipeline for big files in a
perfect loop:
        loop {
                WRITE 4MB
                COMMIT 4MB
        }
While the retry based WB_SYNC_ALL will keep back-off COMMITs because
do_writepages() keep submit new WRITEs. So its loop would be
        loop {
                WRITE 4MB
                <skip COMMIT>
                WRITE 4MB
                <skip COMMIT>
                WRITE 4MB
                <skip COMMIT>
                WRITE 4MB
                <skip COMMIT>
                ...
                <redirty_tail timeout>
                COMMIT 400MB
        }

That can be improved by lifting the writeback chunk size from 4MB to >=128MB.

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html