Re: Adventures in NFS re-exporting

Daire Byrne <daire@xxxxxxxx> · Thu, 12 Nov 2020 23:05:57 +0000 (GMT)

----- On 12 Nov, 2020, at 20:55, bfields bfields@xxxxxxxxxxxx wrote:
> On Thu, Nov 12, 2020 at 06:33:45PM +0000, Daire Byrne wrote:
>> There was some discussion about NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
>> allowing for the hack/optimisation but I guess that is only for the
>> case when re-exporting NFSv4 to the eventual clients. It would not
>> help if you were re-exporting an NFSv3 server with NFSv3 to the
>> clients? I lack the deeper understanding to say anything more than
>> that.
> 
> Oh, right, thanks for the reminder.  The CHANGE_TYPE_IS_MONOTONIC_INCR
> optimization still looks doable to me.
> 
> How does that help, anyway?  I guess it avoids false positives of some
> kind when rpc's are processed out of order?
> 
> Looking back at
> 
>	https://lore.kernel.org/linux-nfs/1155061727.42788071.1600777874179.JavaMail.zimbra@xxxxxxxx/
> 
> this bothers me: "I'm not exactly sure why, but the iversion of the
> inode gets changed locally (due to atime modification?) most likely via
> invocation of method inode_inc_iversion_raw. Each time it gets
> incremented the following call to validate attributes detects changes
> causing it to be reloaded from the originating server."
> 
> The only call to that function outside afs or ceph code is in
> fs/nfs/write.c, in the write delegation case.  The Linux server doesn't
> support write delegations, Netapp does but this shouldn't be causing
> cache invalidations.

So, I can't lay claim to identifying the exact optimisation/hack that improves the retention of the re-export server's client cache when re-exporting an NFSv3 server (which is then read by many clients). We were working with an engineer at the time who showed an interest in our use case and after we supplied a reproducer he suggested modifying the nfs/inode.c

-		if (!inode_eq_iversion_raw(inode, fattr->change_attr)) {
+		if (inode_peek_iversion_raw(inode) < fattr->change_attr) {

His reasoning at the time was:

"Fixes inode invalidation caused by read access. The least
 important bit is ORed with 1 and causes the inode version to differ from the
 one seen on the NFS share. This in turn causes unnecessary re-download
 impacting the performance significantly. This fix makes it only re-fetch file
 content if inode version seen on the server is newer than the one on the
 client."

But I've always been puzzled by why this only seems to be the case when using knfsd to re-export the (NFSv3) client mount. Using multiple processes on a standard client mount never causes any similar re-validations. And this happens with a completely read-only share which is why I started to think it has something to do with atimes as that could perhaps still cause a "write" modification even when read-only?

In our case we saw this at it's most extreme when we were re-exporting a read-only NFSv3 Netapp "software" share and loading large applications with many python search paths to trawl through. Multiple clients of the re-export server just kept causing the re-export server's client to re-validate and re-download from the Netapp even though no files or dirs had changed and the actimeo=large (with nocto for good measure).

The patch made it such that the re-export server's client cache acted the same way if we ran 100 processes directly on the NFSv3 client mount (on the re-export server) or ran it on 100 clients of the re-export server - the data remained in client cache for the duration. So the re-export server fetches the data from the originating server once and then serves all those results many times over to all the clients from it's cache - exactly what we want.

Daire