RE: client caching and locks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Sorry to resurrect this old thread, but I wonder how NFS clients should behave.

I'm seeing this behavior when I run a test program using Open MPI. In the test program, 
two clients acquire locks at each write location. Then they simultaneously write 
data to the same file in NFS. 

In other words, the program does just like Bruce explained previously:

> > > > >         client 0                        client 1
> > > > >         --------                        --------
> > > > >         take write lock on byte 0
> > > > >                                         take write lock on byte 1
> > > > >         write 1 to offset 0
> > > > >           change attribute now x+1
> > > > >                                         write 1 to offset 1
> > > > >                                           change attribute now x+2
> > > > >         getattr returns x+2
> > > > >                                         getattr returns x+2
> > > > >         unlock
> > > > >                                         unlock
> > > > >
> > > > >         take readlock on byte 1

In my test, 
- The file data is zero-filled before the write.
- client 0 acquires a write lock at offset 0 and writes 1 to it.
- client 1 acquires a write lock at offset 4 and writes 2 to it.

After the test, sometimes I'm seeing the following result. A client doesn't reflect the other's update.
-----
- client 0:
[user@client0 nfs]$ od -i data
0000000           1           2
0000010

- client 1:
[user@client1 nfs]# od -i data
0000000           0           2
0000010

- NFS server:
[user@server nfs]# od -i data
0000000           1           2
0000010
-----

This happens because client 1 receives GETATTR reply after both clients' writes completed.
Therefore, client 1 assumes the file data is unchanged since its last write.

For the detail, please see the following analysis of tcpdump collected from NFS server: 
-----
IP addresses are as follows:
- client 0: 192.168.122.158
- client 1: 192.168.122.244
- server: 192.168.122.12

1. client 0 and 1 called OPEN, LOCK and WRITE to write values on each offset of the file simultaneously: 

165587 2021-12-27 19:08:26.792438 192.168.122.244 → 192.168.122.12 NFS 354 V4 Call OPEN DH: 0xc1b3a552/data
165589 2021-12-27 19:08:26.801025 192.168.122.12 → 192.168.122.244 NFS 430 V4 Reply (Call In 165587) OPEN StateID: 0x9357
165592 2021-12-27 19:08:26.802125 192.168.122.158 → 192.168.122.12 NFS 322 V4 Call OPEN DH: 0xc1b3a552/data
165593 2021-12-27 19:08:26.802367 192.168.122.12 → 192.168.122.158 NFS 438 V4 Reply (Call In 165592) OPEN StateID: 0xde4c
165595 2021-12-27 19:08:26.807853 192.168.122.158 → 192.168.122.12 NFS 326 V4 Call LOCK FH: 0x4cdb3daa Offset: 0 Length: 4
165596 2021-12-27 19:08:26.807879 192.168.122.244 → 192.168.122.12 NFS 326 V4 Call LOCK FH: 0x4cdb3daa Offset: 4 Length: 4
165597 2021-12-27 19:08:26.807983 192.168.122.12 → 192.168.122.158 NFS 182 V4 Reply (Call In 165595) LOCK
165598 2021-12-27 19:08:26.808047 192.168.122.12 → 192.168.122.244 NFS 182 V4 Reply (Call In 165596) LOCK
165600 2021-12-27 19:08:26.808473 192.168.122.158 → 192.168.122.12 NFS 294 V4 Call WRITE StateID: 0x8cc0 Offset: 0 Len: 4
165602 2021-12-27 19:08:26.809058 192.168.122.244 → 192.168.122.12 NFS 294 V4 Call WRITE StateID: 0x8a41 Offset: 4 Len: 4

2. client 0 received WRITE reply earlier than client 1 so it called LOCKU and CLOSE.

165607 2021-12-27 19:08:26.843312 192.168.122.12 → 192.168.122.158 NFS 334 V4 Reply (Call In 165600) WRITE
165608 2021-12-27 19:08:26.844218 192.168.122.158 → 192.168.122.12 NFS 282 V4 Call LOCKU FH: 0x4cdb3daa Offset: 0 Length: 4
165609 2021-12-27 19:08:26.844320 192.168.122.12 → 192.168.122.158 NFS 182 V4 Reply (Call In 165608) LOCKU
165611 2021-12-27 19:08:26.845007 192.168.122.158 → 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0xde4c
165613 2021-12-27 19:08:26.845230 192.168.122.12 → 192.168.122.158 NFS 334 V4 Reply (Call In 165611) CLOSE

  At the frame 165607, the file's changeid was 1761580521582393257.

    # tshark -r repro.cap -V "frame.number==165607" | grep -e "changeid:" -e Ops
    Network File System, Ops(4): SEQUENCE PUTFH WRITE GETATTR
                        changeid: 1761580521582393257

3. client 0 called OPEN again while client 1 was still waiting for WRITE reply.

165615 2021-12-27 19:08:26.845652 192.168.122.158 → 192.168.122.12 NFS 322 V4 Call OPEN DH: 0x4cdb3daa/
165616 2021-12-27 19:08:26.847702 192.168.122.12 → 192.168.122.158 NFS 386 V4 Reply (Call In 165615) OPEN StateID: 0x95d6

  At the frame 165616, the file's changeid was incremented to 1761580521582393258 by the server.

    # tshark -r repro.cap -V "frame.number==165616" | grep -e "changeid:" -e Ops
    Network File System, Ops(5): SEQUENCE PUTFH OPEN ACCESS GETATTR
                        changeid: 1761580521582393258

  Therefore, client 0 called READ and reflected updates from other client.

165617 2021-12-27 19:08:26.848454 192.168.122.158 → 192.168.122.12 NFS 270 V4 Call READ StateID: 0x907b Offset: 0 Len: 8
165618 2021-12-27 19:08:26.848572 192.168.122.12 → 192.168.122.158 NFS 182 V4 Reply (Call In 165617) READ
165619 2021-12-27 19:08:26.849096 192.168.122.158 → 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x95d6
165620 2021-12-27 19:08:26.849179 192.168.122.12 → 192.168.122.158 NFS 334 V4 Reply (Call In 165619) CLOSE

4. client 1 finally received WRITE reply and called LOCKU and CLOSE.

165622 2021-12-27 19:08:26.855130 192.168.122.12 → 192.168.122.244 NFS 334 V4 Reply (Call In 165602) WRITE
165623 2021-12-27 19:08:26.855965 192.168.122.244 → 192.168.122.12 NFS 282 V4 Call LOCKU FH: 0x4cdb3daa Offset: 4 Length: 4
165625 2021-12-27 19:08:26.856094 192.168.122.12 → 192.168.122.244 NFS 182 V4 Reply (Call In 165623) LOCKU
165627 2021-12-27 19:08:26.856647 192.168.122.244 → 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x9357
165629 2021-12-27 19:08:26.856784 192.168.122.12 → 192.168.122.244 NFS 334 V4 Reply (Call In 165627) CLOSE

  At the frame 165622, changeid was _not_ incremented.

    # tshark -r repro.cap -V "frame.number==165622" | grep -e "changeid:" -e Ops
    Network File System, Ops(4): SEQUENCE PUTFH WRITE GETATTR
                        changeid: 1761580521582393258

5. client 1 called OPEN again but ...

165635 2021-12-27 19:08:26.858006 192.168.122.244 → 192.168.122.12 NFS 322 V4 Call OPEN DH: 0x4cdb3daa/
165636 2021-12-27 19:08:26.859538 192.168.122.12 → 192.168.122.244 NFS 386 V4 Reply (Call In 165635) OPEN StateID: 0x3d15

  ... since no further changes were made to the file, the changeid wasn't updated. 

    # tshark -r repro.cap -V "frame.number==165636" | grep -e "changeid:" -e Ops
    Network File System, Ops(5): SEQUENCE PUTFH OPEN ACCESS GETATTR
                        changeid: 1761580521582393258

6. client 1 assumed the file data was unchanged since its last write. It called CLOSE without calling READ.

165637 2021-12-27 19:08:26.860201 192.168.122.244 → 192.168.122.12 NFS 278 V4 Call CLOSE StateID: 0x3d15
165638 2021-12-27 19:08:26.860296 192.168.122.12 → 192.168.122.244 NFS 334 V4 Reply (Call In 165637) CLOSE
-----

So current implementation of NFS client doesn't assure reflecting other clients' update
when it is written simultaneously in non-overlapping ranges. Because it isn't assured 
in RFC 7530 either, I think this is not a bug.

However, if it will be implemented, what approaches are affordable?

If a client can invalidate whole cache of the file on unlock, each client sees other's update
written in non-overlapping ranges. I verified it with the following changes in
fs/nfs/file.c:do_unlk().
----------
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -746,6 +746,14 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct iov_iter *from)
                status = NFS_PROTO(inode)->lock(filp, cmd, fl);
        else
                status = locks_lock_file_wait(filp, fl);
+
+       nfs_sync_mapping(filp->f_mapping);
+       if (!NFS_PROTO(inode)->have_delegation(inode, FMODE_READ)) {
+               nfs_zap_caches(inode);
+               if (mapping_mapped(filp->f_mapping))
+                       nfs_revalidate_mapping(inode, filp->f_mapping);
+       }
+
        return status;
 }
----------
But I feel this approach is redundant since client invalidates its cache on lock in fs/nfs/file.c:do_setlk().
Also, the above change may cause a performance degradation. Are there any other approach 
we can take? Or, doesn't NFS have to implement it because it's not a bug ?

Yuki Inoguchi

> -----Original Message-----
> From: bfields@xxxxxxxxxxxx <bfields@xxxxxxxxxxxx>
> Sent: Wednesday, October 7, 2020 2:26 AM
> To: Matt Benjamin <mbenjami@xxxxxxxxxx>
> Cc: Trond Myklebust <trondmy@xxxxxxxxxxxxxxx>; Inoguchi, Yuki/井ノ口
> 雄生 <inoguchi.yuki@xxxxxxxxxxx>; linux-nfs@xxxxxxxxxxxxxxx
> Subject: Re: client caching and locks
> 
> On Thu, Oct 01, 2020 at 06:26:25PM -0400, Matt Benjamin wrote:
> > I'm not sure.  My understanding has been that, NFSv4 does not mandate
> > a mechanism to update clients of changes outside of any locked range.
> > In AFS (and I think DCE DFS?) this role is played by DataVersion.  If
> > I recall correctly, David Noveck provided an errata that addresses
> > this, that servers could use in a similar manner to DV, but I don't
> > recall the details.
> 
> Maybe you're thinking of the change_attr_type that's new to 4.2?  I
> think that was Trond's proposal originally.  In the
> CHANGE_TYPE_IS_VERSION_COUNTER case it would in theory allow you to
> tell
> whether a file that you'd written to was also written to by someone else
> by counting WRITE operations.
> 
> But we still have to ensure consistency whether the server implements
> that.  (I doubt any server currently does.)
> 
> --b.
> 
> >
> > Matt
> >
> > On Thu, Oct 1, 2020 at 5:48 PM bfields@xxxxxxxxxxxx
> > <bfields@xxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 22, 2020 at 09:52:22AM -0400, bfields@xxxxxxxxxxxx wrote:
> > > > On Thu, Jun 18, 2020 at 04:09:05PM -0400, bfields@xxxxxxxxxxxx wrote:
> > > > > I probably don't understand the algorithm (in particular, how it
> > > > > revalidates caches after a write).
> > > > >
> > > > > How does it avoid a race like this?:
> > > > >
> > > > > Start with a file whose data is all 0's and change attribute x:
> > > > >
> > > > >         client 0                        client 1
> > > > >         --------                        --------
> > > > >         take write lock on byte 0
> > > > >                                         take write lock on byte 1
> > > > >         write 1 to offset 0
> > > > >           change attribute now x+1
> > > > >                                         write 1 to offset 1
> > > > >                                           change attribute now
> x+2
> > > > >         getattr returns x+2
> > > > >                                         getattr returns x+2
> > > > >         unlock
> > > > >                                         unlock
> > > > >
> > > > >         take readlock on byte 1
> > > > >
> > > > > At this point a getattr will return change attribute x+2, the same as
> > > > > was returned after client 0's write.  Does that mean client 0 assumes
> > > > > the file data is unchanged since its last write?
> > > >
> > > > Basically: write-locking less than the whole range doesn't prevent
> > > > concurrent writes outside that range.  And the change attribute gives us
> > > > no way to identify whether concurrent writes have happened.  (At least,
> > > > not without NFS4_CHANGE_TYPE_IS_VERSION_COUNTER.)
> > > >
> > > > So as far as I can tell, a client implementation has no reliable way to
> > > > revalidate its cache outside the write-locked area--instead it needs to
> > > > just throw out that part of the cache.
> > >
> > > Does my description of that race make sense?
> > >
> > > --b.
> > >
> >
> >
> > --
> >
> > Matt Benjamin
> > Red Hat, Inc.
> > 315 West Huron Street, Suite 140A
> > Ann Arbor, Michigan 48103
> >
> > http://www.redhat.com/en/technologies/storage
> >
> > tel.  734-821-5101
> > fax.  734-769-8938
> > cel.  734-216-5309




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux