On Fri, Jul 05, 2024 at 09:25:56AM +1000, Dave Chinner wrote: > On Thu, Jul 04, 2024 at 07:00:23PM +0000, Chuck Lever III wrote: > > > > > > > On Jul 3, 2024, at 6:24 PM, NeilBrown <neilb@xxxxxxx> wrote: > > > > > > > > > I've been pondering security questions with localio - particularly > > > wondering what questions I need to ask. I've found three focal points > > > which overlap but help me organise my thoughts: > > > 1- the LOCALIO RPC protocol > > > 2- the 'auth_domain' that nfsd uses to authorise access > > > 3- the credential that is used to access the file > > > > > > 1/ It occurs to me that I could find out the UUID reported by a given > > > local server (just ask it over the RPC connection), find out the > > > filehandle for some file that I don't have write access to (not too > > > hard), and create a private NFS server (hacking nfs-ganasha?) which > > > reports the same uuid and reports that I have access to a file with > > > that filehandle. If I then mount from that server inside a private > > > container on the same host that is running the local server, I would get > > > localio access to the target file. > > This seems amazingly complex for something that is actually really > simple. Could be completely wrong, but I'm inferring you've read more linux-nfs email (particularly about alternative directions for implementation) than looked at the localio code. But more below. > Keep in mind that I am speaking from having direct > experience with developing and maintaining NFS client IO bypass > infrastructure from when I worked at SGI as an NFS engineer. Thanks for sharing all this about IRIX, really helpful. > So, let's look at the Irix NFS client/server and the "Bulk Data > Service" protocol extensions that SGI wrote for NFSv3 back in the > mid 1990s. Here's an overview from the 1996 product documentation > "Getting Started with BDSpro": > > https://irix7.com/techpubs/007-3274-001.pdf > > At least read chapter 1 so you grok the fundamentals of how the IO > bypass worked. It should look familiar, because it isn't very > different to how NFS over RDMA or client side IO for pNFS works. > > Essentially, The NFS client transparently sent all the data IO (read > and write) over a separate communications channel for any IO that > met the size and alignment constraints. This was effectively a > "remote-IO" bypass that streamed data rather than packetised it > (NFS_READ/NFS_WRITE is packetised data with RTT latency issues). > By getting rid of the round trip latency penalty, data could be > sent/recieved at full network throughput rates. > > [ As an aside, the BDS side channel was also the mechanism that used > by SGI for NFS over RDMA with custom full stack network offload > hardware back in the mid 1990s. NFS w/ BDS ran at about 800MB/s on > those networks on machines with 200MHz CPUs (think MIPS r10k). ] > > The client side userspace has no idea this low level protocol > hijacking occurs, and it doesn't need to because all it changes > is the read/write IO speed. The NFS protocol is still used for all > authorisation, access checks, metadata operations, etc, and all that > changes is how NFS_READ and NFS_WRITE operations are performed. > > The local-io stuff is no different - we're just using a different > client side IO path in kernel. We don't need a new protocol, nor do > we need userspace to be involved *at all*. The kernel NFS client > can easily discover that it is on the same host as the server. The > server already does this "client is on the same host", so both will > then know they can *transparently* enable the localio bypass without > involving userspace at all. > > The NFS protocol still provides all the auth, creds, etc to allow > the NFS client read and write access to the file. The NFS server > provides the client with a filehandle build by the underlying > filesystem for the file the NFS client has been permission to > access. > > The local filesystem will accept that filehandle from any kernel > side context via the export ops for that filesystem. This provides > a mechanism for the NFS client to convert that to a dentry > and so open the file directly from the file handle. This is what the > server already does, so it should be able to share the filehandle > decode and open code from the server, maybe even just reach into the > server export table directly.... > > IOWs, we don't need to care about whether the mount is visible to > the NFS client - the filesystem *export* is visible to the *kernel* > and the export ops allow unfettered filehandle decoding. Containers > are irrelevant - the server has granted access to the file, and so > the NFS client has effective permissions to resolve the filehandle > directly.. IRIX sounds well engineered. The Linux NFS code's interfaces aren't so clean/precise. The data structures are pretty tightly coupled (one big struct nfs_client, nfs_server, nfsd_net, etc. sunrpc's svc_rqst in particular carries auth info that is used as part of the the wire protocol business end -- so NFS auth is in the layer localio wants to bypass). And as such interfaces tend to do a lot of different tasks on behalf of structures that carry the kitchen sink. So retrofitting the Linux NFS and RPC code to allow a subset of the NFS client and server code to be used isn't so clean. But what you described IRIX did is pretty much what my localio series provides, see: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=nfs-localio-for-next (Always room for improvement, like I said to Christoph, especially on the IO submission and handling side.. as you've seen it is doing buffered IO and is synchronous.. really leaving performance lackluster but lots of upside to be had making it async and support DIO). > Fundamentally, this is the same permission and access model that > pNFS is built on. Hence I don't understand why this local-io bypass > needs something completely new and seemingly very complex... pNFS doesn't need to have a direct role in any of this localio code, but pNFS can use localio if it is enabled, e.g.: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=5e7bf77fbbecdbea0e6ae7174c97a69b11e3098a Localio not being tightly coupled to pNFS enables localio to easily support NFSv3. NFSv3 support is a requirement and there is no reason not to support it. Anyway, I really think Neil's ideas for localio improvement are solid. Especially factoring out the auth_domain to ensure bog standard NFS authentication and security mechanisms used. Though IMO the proposed localio protocol changes aren't _really_ needed, but I also won't fight to stop localio nfsd UUID sharing being more ephemeral and risk-averse... The only reason there is a sideband/auxilliary "localio protocol" is the NFS protocol is very focused on enabling NFS spec implementation. I actually framed it in terms of NFS encode and decode on the server side and Chuck wanted me to make sure to decouple localio so that it stood on its own (I agree with him, I just didn't think to do it that way). I just needed a a means to generate and get a UUID from the server to anchor the mechanism for nfs_common to allow the client and server to rendezvous, see: nfs_common: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=cb542e791eda114adcc9291feb6c66a5ea338f7c nfs server: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=877a8212c3af37b5ba32959275f4c49bfe805f24 nfs client: https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=572b36de2bb1dde06d6da4488686c9fbbc79d7e1 Really quite simple. And the pgio hooks used to branch to localio handling of READ, WRITE and COMMIT are the interface point for then generating a kiocb and issuing the IO accordingly (last 2 commits below): https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=46569e6d92a074188bb1f0090d36c327729ab418 But even the buffered and direct IO in nfsd are really tightly coupled to the wire protocol interface. So localio hooks pgio and calls down to the underlying filesystem with its own side channel (that uses .read_iter and .write_iter), see fs/nfs/localio.c https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=877a8212c3af37b5ba32959275f4c49bfe805f24 https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=nfs-localio-for-next&id=4222309dac70e485f089738d0ffe9113b9a5a1e1