Just out of curiosity: did you have tries instead of re-exporting nfs mount directly re-export an overlayfs mount on top of the original nfs mount? Such setup should cover most of your issues. Regards, Tigran. ----- Original Message ----- > From: "Daire Byrne" <daire@xxxxxxxx> > To: "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> > Cc: linux-cachefs@xxxxxxxxxx > Sent: Monday, September 7, 2020 7:31:00 PM > Subject: Adventures in NFS re-exporting > Hi, > > Apologies for this rather long email, but I thought there may be some interest > out there in the community in how and why we've been doing something > unsupported and barely documented - NFS re-exporting! And I'm not sure I can > tell our story well in just a few short sentences so please bear with me (or > stop now!). > > Full disclosure - I am also rather hoping that this story piques some interest > amongst developers to help make our rather niche setup even better and perhaps > a little better documented. I also totally understand if this is something > people wouldn't want to touch with a very long barge pole.... > > First a quick bit of history (I hope I have this right). Late in 2015, Jeff > Layton proposed a patch series allowing knfsd to re-export a NFS client mount. > The rationale then was to provide a "proxy" server that could mount an NFSv4 > only server and re-export it to older clients that only supported NFSv3. One of > the main sticking points then (as now), was around the 63 byte limit of > filehandles for NFSv3 and how it couldn't be guaranteed that all re-exported > filehandles would fit within that (in my experience it mostly works with > "no_subtree_check"). There are also the usual locking and coherence concerns > with NFSv3 too but I'll get to that in a bit. > > Then almost two years later, v4.13 was released including parts of the patch > series that actually allowed the re-export and since then other relevant bits > (such as the open file cache) have also been merged. I soon became interested > in using this new functionality to both accelerate our on-premises NFS storage > and use it as a "WAN cache" to provide cloud compute instances locally cached > proxy access to our on-premises storage. > > Cut to a brief introduction to us and what we do... DNEG is an award winning VFX > company which uses large compute farms to generate complex final frame renders > for movies and TV. This workload mostly consists of reads of common data shared > between many render clients (e.g textures, geometry) and a little unique data > per frame. All file writes are to unique files per process (frames) and there > is very little if any writing over existing files. Hence it's not very > demanding on locking and coherence guarantees. > > When our on-premises NFS storage is being overloaded or the server's network is > maxed out, we can place multiple re-export servers in between them and our farm > to improve performance. When our on-premises render farm is not quite big > enough to meet a deadline, we spin up compute instances with a (reasonably > local) cloud provider. Some of these cloud instances are Linux NFS servers > which mount our on-premises NFS storage servers (~10ms away) and re-export > these to the other cloud (render) instances. Since we know that the data we are > reading doesn't change often, we can increase the actimeo and even use nocto to > reduce the network chatter back to the on-prem servers. These re-export servers > also use fscache/cachefiles to cache data to disk so that we can retain TBs of > previously read data locally in the cloud over long periods of time. We also > use NFSv4 (less network chatter) all the way from our on-prem storage to the > re-export server and then on to the clients. > > The re-export server(s) quickly builds up both a memory cache and disk backed > fscache/cachefiles storage cache of our working data set so the data being > pulled from on-prem lessens over time. Data is only ever read once over the WAN > network from on-prem storage and then read multiple times by the many render > client instances in the cloud. Recent NFS features such as "nconnect" help to > speed up the initial reading of data from on-prem by using multiple connections > to offset TCP latency. At the end of the render, we write the files back > through the re-export server to our on-prem storage. Our average read bandwidth > is many times higher than our write bandwidth. > > Rather surprisingly, this mostly works for our particular workloads. We've > completed movies using this setup and saved money on commercial caching systems > (e.g Avere, GPFS, etc). But there are still some remaining issues with doing > something that is very much not widely supported (or recommended). In most > cases we have worked around them, but it would be great if we didn't have to so > others could also benefit. I will list the main problems quickly now and > provide more information and reproducers later if anyone is interested. > > 1) The kernel can drop entries out of the NFS client inode cache (under memory > cache churn) when those filehandles are still being used by the knfsd's remote > clients resulting in sporadic and random stale filehandles. This seems to be > mostly for directories from what I've seen. Does the NFS client not know that > knfsd is still using those files/dirs? The workaround is to never drop inode & > dentry caches on the re-export servers (vfs_cache_pressure=1). This also helps > to ensure that we actually make the most of our actimeo=3600,nocto mount > options for the full specified time. > > 2) If we cache metadata on the re-export server using actimeo=3600,nocto we can > cut the network packets back to the origin server to zero for repeated lookups. > However, if a client of the re-export server walks paths and memory maps those > files (i.e. loading an application), the re-export server starts issuing > unexpected calls back to the origin server again, ignoring/invalidating the > re-export server's NFS client cache. We worked around this this by patching an > inode/iversion validity check in inode.c so that the NFS client cache on the > re-export server is used. I'm not sure about the correctness of this patch but > it works for our corner case. > > 3) If we saturate an NFS client's network with reads from the server, all client > metadata lookups become unbearably slow even if it's all cached in the NFS > client's memory and no network RPCs should be required. This is the case for > any NFS client regardless of re-exporting but it affects this case more because > when we can't serve cached metadata we also can't serve the cached data. It > feels like some sort of bottleneck in the client's ability to parallelise > requests? We work around this by not maxing out our network. > > 4) With an NFSv4 re-export, lots of open/close requests (hundreds per second) > quickly eat up the CPU on the re-export server and perf top shows we are mostly > in native_queued_spin_lock_slowpath. Does NFSv4 also need an open file cache > like that added to NFSv3? Our workaround is to either fix the thing doing lots > of repeated open/closes or use NFSv3 instead. > > If you made it this far, I've probably taken up way too much of your valuable > time already. If nobody is interested in this rather niche application of the > Linux client & knfsd, then I totally understand and I will not mention it here > again. If your interest is piqued however, I'm happy to go into more detail > about any of this with the hope that this could become a better documented and > understood type of setup that others with similar workloads could reference. > > Also, many thanks to all the Linux NFS developers for the amazing work you do > which, in turn, helps us to make great movies. :) > > Daire (Head of Systems DNEG) -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs