Re: [patch 8/8] fs: add i_op->sync_inode

Nick Piggin <npiggin@xxxxxxxxx> · Fri, 7 Jan 2011 15:47:34 +1100

On Thu, Jan 06, 2011 at 03:45:10PM -0500, Christoph Hellwig wrote:
> > > The problem is that currently we almost never do a pure blocking
> > > ->write_inode.  The way the sync code is written we always do a
> > > non-blocking one first, then a blocking one.  If you always do the
> > > synchronous one we'll get a lot more overhead - the first previous
> > > asynchronous one will write the inode (be it just into the log, or for
> > > real), then we write back data, and then we'll have to write it again
> > > becaus it has usually been redirtied again due to the data writeback in
> > > the meantime.
> > 
> > It doesn't matter, the integrity still has to be enforced in .sync_fs,
> > because sync .write_inode may *never* get called, because of the fact
> > that async .write_inode also clears the inode metadata dirty bits.
> > 
> > So if .sync_fs has to enforce integrity anyway, then you don't ever need
> > to do any actual waiting in your sync .write_inode. See?
> 
> I'm not talking about the actual waiting.  Right now we have two
> different use cases for ->write_inode:
> 
>  1) sync_mode == WB_SYNC_NONE
> 
> 	This tells the filesystem to start an opportunistic writeout.
> 
>  2) sync_mode == WB_SYNC_ALL
> 
> 	This tells the filesystem it needs to to a mandatory writeout.
> 
> Note that writeout is losely defined.  If a filesystems isn't
> exportable or implements the commit_metadata operation it's indeed
> enough to synchronize the state into internal fs data just enough for
> ->sync_fs.
> 
> Or that's how it should be.  As you pointed out the way the writeback
> code treats the WB_SYNC_NONE writeouts makes this not work as expected.
> 
> There's various ways to fix this:
> 
>  1) the one you advocate, that is treating all ->write_inode calls as
>     if they were WB_SYNC_ALL.  This does fix the issue of incorrectly
>     updating the dirty state, but causes a lot of additional I/O -
>     the way the sync process is designed we basically always call
>     ->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
>  2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
>     This also fixes the i_dirty state updates, but allows filesystems
>     that keep internal dirty state to be smarted about avoiding I/O
>  3) remove the calls to ->write_inode with WB_SYNC_NONE.  This might
>     work well for calls from the sync() callchain, but we rely on
>     inode background writeback from the flusher threads in lots of
>     places.  Note that we really do not want to block the flusher
>     threads with blocking writes, which is another argument against
>     (1).
>  4) require ->write_inode to update the dirty state itself after
>     the inode is on disk or in a data structure caught by ->sync_fs.
>     This keeps optimal behaviour, but requires a lot of code changes.
> 
> If we want a quick fix only (2) seems feasibly to me, with the option
> of implementing (4) and parts of (3) later on.

No, you misunderstand 1. I am saying they should be treated as
WB_SYNC_NONE.

In fact 2 would cause much more IO, because dirty writeout would
never clean them so it will just keep writing them out. I don't
know how 2 could be feasible.

> > > We need to propagate the VFS dirty state into the fs-internal state,
> > > e.g. for XFS start a transaction.  The reason for that is that the VFS
> > > simply writes timestamps into the inode and marks it dirty instead of
> > > telling the filesystem about timestamp updates.  For XFS in
> > > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > > metadata changes, and thus now the only thing going through
> > > ->write_inode.
> > 
> > Well then you have a bug, because a sync .write_inode *may never get
> > called*. You may only get an async one, even in the case of fsync,
> > because async writeback clears the vfs dirty bits.
> 
> Yes, the bug about updating the dirty state for WB_SYNC_NONE affects

So, back to my original question: what is the performance problem
with treating write_inode as WB_SYNC_NONE, and then having .fsync
and .sync_fs do the integrity?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html