On Tue, Dec 10, 2024 at 05:33:14PM +0000, Trond Myklebust wrote: > On Tue, 2024-12-10 at 12:12 -0500, Nikhil Jha wrote: > > On Tue, Dec 10, 2024 at 07:11:43AM -0500, Benjamin Coddington wrote: > > > On 9 Dec 2024, at 12:39, Nikhil Jha wrote: > > > > > > > Hello! This is the first kernel patch I have tried to upstream. > > > > I'm > > > > following along with the kernel newbies guide but apologies if I > > > > got > > > > anything wrong. > > > > > > > > Currently, if there is a mismatch in the request and response > > > > fileids in > > > > an NFS request, the kernel logs an error and attempts to return > > > > ESTALE. > > > > However, this error is currently dropped before it makes it all > > > > the way > > > > to userspace. This appears to be a mistake, since as far as I can > > > > tell > > > > that ESTALE value is never consumed from anywhere. > > > > > > > > Callstack for async NFS write, at time of error: > > > > > > > > nfs_update_inode <- returns -ESTALE > > > > nfs_refresh_inode_locked > > > > nfs_writeback_update_inode <- error is dropped here > > > > nfs3_write_done > > > > nfs_writeback_done > > > > nfs_pgio_result <- other errors are collected here > > > > rpc_exit_task > > > > __rpc_execute > > > > rpc_async_schedule > > > > process_one_work > > > > worker_thread > > > > kthread > > > > ret_from_fork > > > > > > > > We ran into this issue ourselves, and seeing the -ESTALE in the > > > > kernel > > > > source code but not from userspace was surprising. > > > > > > > > I tested a rebased version of this patch on an el8 kernel > > > > (v6.1.114), > > > > and it seems to correctly propagate this error. > > > > > > > > > 8------------------------------------------------------8< > > > > > > > > If an NFS server returns a response with a different file id to > > > > the > > > > response, the kernel currently prints out an error and attempts > > > > to > > > > return -ESTALE. However, this -ESTALE value is never surfaced > > > > anywhere. > > > > > > Hi Nikhil Jha, > > > > > > Will this cause us to return -ESTALE to the application even if the > > > WRITE > > > was successful? > > > > > > Ben > > > > > > > Hi Ben, > > > > Hmm.. I'm not sure how to answer that question exactly. A fileid is > > only > > mismatched between a request RPC and response RPC when something > > really > > weird is going on (i.e. a bug in the NFS server), so it's hard to > > reason > > about if a WRITE was successful or not. > > > > The `return -ESTALE` was already in the kernel code, but this > > particular > > codepath seems to have accidentally dropped that error. > > > > Nikhil > > > > I'm not keen on taking any client changes to paper over what appears to > be a major server bug. We simply can't support broken servers that > reuse filehandles across files. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > Right, this patch *doesn't* alleviate any server bug from the client. I don't think it would make sense to try to fix a buggy server from the Linux NFS client. What I'm thinking about is... 1. We're already detecting this problem, printing an error message, and trying to return ESTALE. 2. That ESTALE is not propagated anywhere. It just gets dropped. If we don't want this ESTALE to be visible, should it just get removed entirely? The value does not appear to be consumed from anywhere in the kernel code.