Re: [nfsv4] layoutcommits and file layout

Trond Myklebust <Trond.Myklebust@xxxxxxxxxx> · Wed, 05 Jan 2011 14:14:34 -0500

On Wed, 2011-01-05 at 14:04 -0500, Trond Myklebust wrote: 
> On Wed, 2011-01-05 at 21:01 +0200, Benny Halevy wrote: 
> > On 2011-01-03 16:40, Trond Myklebust wrote:
> > > On Mon, 2011-01-03 at 16:21 +0200, Benny Halevy wrote: 
> > >> On 2010-12-17 01:07, Christoph Hellwig wrote:
> > >>> On Thu, Dec 16, 2010 at 11:21:21AM -0500, Matt W. Benjamin wrote:
> > >>>> Hi,
> > >>>>
> > >>>> We have a files implementation which wants to receive LAYOUTCOMMIT when a client is finished with a layout.  It was my clear understanding from rfc5661 that we could expect this behavior.
> > >>>
> > >>> Care to post it to the list?
> > >>>
> > >>
> > >> I don't know what Matt's server is doing but the fundamental problem is
> > >> manifested with extending a file with parallel DS writes.
> > >> Assuming that the DS writes are executed in arbitrary order,
> > >> exposing the file length before LAYOUTCOMMIT can cause
> > >> a concurrent reader to read a hole.  Although locking can
> > >> solve this case, day-to-day applications that work well over
> > >> local filesystem and legacy NFS may break because of this.
> > > 
> > > ...and this differs from ordinary NFS writes exactly how?
> > > 
> > > Both cached and uncached (i.e. O_DIRECT) writes can and will be flushed
> > > to disk in entirely random order when writing to the MDS. If you have a
> > > parallel reader on another client (or even on the same client in the
> > > case of O_DIRECT), and want it to see accurate data, then use locking.
> > > If not, you will see holes and other strangeness.
> > > 
> > > IOW: There are no 'day-to-day applications that work well over legacy
> > > NFS' that rely on this behaviour.
> > > 
> > 
> > Assuming the client writes sequentially (over tcp) the writes will
> > practically be processed in order into the server's cache so with
> > no crashes in the mix a parallel reader will see no holes.
> > I'd really like the following scenario to work over pNFS with
> > no hassles:
> > 	"some app >> foo" on one client, and
> > 	"tail -f foo" on another
> 
> No, that doesn't work today! Believe me, I get the "bug reports"...
> 
> There is no point in trying to add properties to pNFS that don't exist
> with ordinary NFS.

...and for the record: use of TCP does _not_ suffice to ensure writes
are processed in order.

In the Linux kernel, we have all sorts of parallelism going on before
the writes even hit the socket on the client. Everything from background
flushing to queuing in the sunrpc layer (e.g. for a session slot)
conspires to destroy any hope of ever achieving what you propose above.

That's not even counting what goes on with the server side. Think, for
instance, of the case where the server crashes before a COMMIT has been
successfully sent. Not only will your reader see holes, it will think
the file has been truncated...

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html