RE: client caching and locks

"NeilBrown" <neilb@xxxxxxx> · Tue, 28 Dec 2021 16:11:51 +1100

On Tue, 28 Dec 2021, inoguchi.yuki@xxxxxxxxxxx wrote:
> Hi,
> 
> Sorry to resurrect this old thread, but I wonder how NFS clients should behave.
> 
> I'm seeing this behavior when I run a test program using Open MPI. In the test program, 
> two clients acquire locks at each write location. Then they simultaneously write 
> data to the same file in NFS. 
> 
> In other words, the program does just like Bruce explained previously:
> 
> > > > > >         client 0                        client 1
> > > > > >         --------                        --------
> > > > > >         take write lock on byte 0
> > > > > >                                         take write lock on byte 1
> > > > > >         write 1 to offset 0
> > > > > >           change attribute now x+1
> > > > > >                                         write 1 to offset 1
> > > > > >                                           change attribute now x+2
> > > > > >         getattr returns x+2
> > > > > >                                         getattr returns x+2
> > > > > >         unlock
> > > > > >                                         unlock
> > > > > >
> > > > > >         take readlock on byte 1
> 
> In my test, 
> - The file data is zero-filled before the write.
> - client 0 acquires a write lock at offset 0 and writes 1 to it.
> - client 1 acquires a write lock at offset 4 and writes 2 to it.
> 
> After the test, sometimes I'm seeing the following result. A client doesn't reflect the other's update.
> -----
> - client 0:
> [user@client0 nfs]$ od -i data
> 0000000           1           2
> 0000010
> 
> - client 1:
> [user@client1 nfs]# od -i data
> 0000000           0           2
> 0000010
> 
> - NFS server:
> [user@server nfs]# od -i data
> 0000000           1           2
> 0000010
> -----
> 
> This happens because client 1 receives GETATTR reply after both clients' writes completed.
> Therefore, client 1 assumes the file data is unchanged since its last write.

This is due to an (arguable) weakness in the NFSv4 protocol.
In NFSv3 the reply to the WRITE request had "wcc" data which would
report change information before and after the write and, if present, it
was required to be atomic.  So, providing timestamps had a high
resolution, the client0 would see change information from *before* the
write from client1 completed.  So it would know it needed to refresh
after that write became visible.

With NFSv4 there is no atomicity guarantees relating to writes and
changeid.
There is provision for atomicity around directory operations, but not
around data operations.

This means that if different clients access a file concurrently, then
their cache can become incorrect.  The only way to ensure uncorrupted
data is to use locking for ALL reads and writes.  The above 'od -i' does
not perform a locked read, so can give incorrect data.
If you got a whole-file lock before reading, then you should get correct
data. 
You could argue that this requirement (always lock if there is any risk)
is by design, and so this aspect of the protocl is not a weakness.

NeilBrown