Re: Read/Write NFS I/O performance degraded by FLUSH_STABLE page flushing

Trond Myklebust <trond.myklebust@xxxxxxxxxx> · Sat, 30 May 2009 08:26:03 -0400

On Sat, 2009-05-30 at 10:22 +1000, Greg Banks wrote:
> On Sat, May 30, 2009 at 3:35 AM, Trond Myklebust
> <trond.myklebust@xxxxxxxxxx> wrote:
> > On Fri, 2009-05-29 at 13:25 -0400, Brian R Cowan wrote:
> >>
> >
> > What are you smoking? There is _NO_DIFFERENCE_ between what the server
> > is supposed to do when sent a single stable write, and what it is
> > supposed to do when sent an unstable write plus a commit. BOTH cases are
> > supposed to result in the server writing the data to stable storage
> > before the stable write / commit is allowed to return a reply.
> 
> This probably makes no difference to the discussion, but for a Linux
> server there is a subtle difference between what the server is
> supposed to do and what it actually does.
> 
> For a stable WRITE rpc, the Linux server sets O_SYNC in the struct
> file during the vfs_writev() call and expects the underlying
> filesystem to obey that flag and flush the data to disk.  For a COMMIT
> rpc, the Linux server uses the underlying filesystem's f_op->fsync
> instead.  This results in some potential differences:
> 
>  * The underlying filesystem might be broken in one code path and not
> the other (e.g. ignoring O_SYNC in f_op->{aio_,}write or silently
> failing in f_op->fsync).  These kinds of bugs tend to be subtle
> because in the absence of a crash they affect only the timing of IO
> and so they might not be noticed.
> 
>  * The underlying filesystem might be doing more or better things in
> one or the other code paths e.g. optimising allocations.
> 
>  * The Linux NFS server ignores the byte range in the COMMIT rpc and
> flushes the whole file (I suspect this is a historical accident rather
> than deliberate policy).  If there is other dirty data on that file
> server-side, that other data will be written too before the COMMIT
> reply is sent.  This may have a performance impact, depending on the
> workload.
> 
> > The extra RPC round trip (+ parsing overhead ++++) due to the commit
> > call is the _only_ difference.
> 
> This is almost completely true.  If the server behaved ideally and
> predictably, this would be completely true.
> 
> </pedant>
> 

Firstly, the server only uses O_SYNC if you turn off write gathering
(a.k.a. the 'wdelay' option). The default behaviour for the Linux nfs
server is to always try write gathering and hence no O_SYNC.

Secondly, even if it were the case, then this does not justify changing
the client behaviour. The NFS protocol does not mandate, or even
recommend that the server use O_SYNC. All it says is that a stable write
and an unstable write+commit should both have the same result: namely
that the data+metadata must have been flushed to stable storage. The
protocol spec leaves it as an exercise to the server implementer to do
this as efficiently as possible.

  Trond

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html