Re: Correctly understanding Linux's close-to-open consistency

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Mon, 17 Sep 2018 02:19:09 +0000

On Sun, 2018-09-16 at 20:18 -0400, Chris Siebenmann wrote:
> > > >  Since failing to close() before another machine open()s puts
> > > > you
> > > > outside this outline of close-to-open, this kernel behavior is
> > > > not a bug as such (or so it's been explained to me here).  If
> > > > you
> > > > go outside c-t-o, the kernel is free to do whatever it finds
> > > > most
> > > > convenient, and what it found most convenient was to not bother
> > > > invalidating some cached page data even though it saw a GETATTR
> > > > change.
> > > 
> > > That would be a bug. If we have reason to believe the file has
> > > changed, then we must invalidate the cache on the file prior to
> > > allowing a read to proceed.
> > 
> > The point here is that when the file is open for writing (or for
> > read+write), and your applications are not using locking, then we
> > have
> > no reason to believe the file is being changed on the server, and
> > we
> > deliberately optimise for the case where the cache consistency
> > rules
> > are being observed.
> 
>  In this case the user level can be completely sure that the client
> kernel has issued a GETATTR and received a different answer from the
> NFS server, because the fstat() results it sees have changed from the
> values it has seen before (and remembered). This may not count as the
> NFS client kernel code '[having] reason to believe' that the file has
> changed on the server from its perspective, but if so it's not
> because
> the information is not available and a GETATTR would have to be
> explicitly
> issued to find it out. The client code has made the GETATTR and
> received
> different results, which it has passed to user level; it has just not
> used those results to do things to its cached data.
> 
>  Today, if you do a flock(), the NFS client code in the kernel will
> do things that invalidate the cached data, despite the GETATTR result
> from the fileserver not changing. From my outside perspective, as
> someone
> writing code or dealing with programs that must work over NFS, this
> is a
> little bit magical, and as a result I would like to understand if it
> is
> guaranteed that the magic works or if this is not officially
> supported
> magic, merely 'it happens to work' magic in the way that having the
> file open read-write without the flock() used to work in kernel 4.4.x
> but doesn't now (and this is simply considered to be the kernel using
> CTO more strongly, not a bug).
> 
> (Looking at a tcpdump trace, the flock() call appears to cause the
> kernel
> to issue another GETATTR to the fileserver. The results are the same
> as
> the GETATTR results that were passed to the client program.)

This is also documented in the NFS FAQ to which I pointed you earlier.

> > Again, these are the cases where you are _not_ using locking to
> > mediate. If you are using locking, then I agree that changes need
> > to
> > be seen by the client.
> 
>  The original code (Alpine) *is* using locking in the broad sense,
> but it is not flock() locking; instead it is locking (in this case)
> through .lock files. The current kernel behavior and what I've been
> told about it implies that it is not sufficient for your application
> to
> perfectly coordinate locking, writes, fsync(), and fstat() visibility
> of the resulting changes through its own mechanism; you must do your
> locking through the officially approved kernel channels (and it is
> not
> clear what they are) or see potentially incorrect results.
> 
>  Consider a system where reads and writes to a shared file are
> coordinated by a central process that everyone communicates with
> through
> TCP connections. The central process pauses readers before it allows
> a writer to start, the writer always fsync()s before it releases its
> write permissions, and then no reader is permitted to proceed until
> the
> entire cluster sees the same updated fstat() result. This is
> perfectly
> coordinated but currently could see incorrect read() results, and
> I've
> been told that this is allowed under Linux's CTO rules because all of
> the processes hold the file open read-write through this entire
> process
> (and no one flock()s).
> 

Why would such a system need to use buffered I/O instead of uncached
I/O (i.e. O_DIRECT)? What would be the point of optimising the buffered
I/O client for this use case rather than the close to open cache
consistent case?

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx