On Tue, 2024-12-10 at 12:12 -0500, Nikhil Jha wrote: > On Tue, Dec 10, 2024 at 07:11:43AM -0500, Benjamin Coddington wrote: > > On 9 Dec 2024, at 12:39, Nikhil Jha wrote: > > > > > Hello! This is the first kernel patch I have tried to upstream. > > > I'm > > > following along with the kernel newbies guide but apologies if I > > > got > > > anything wrong. > > > > > > Currently, if there is a mismatch in the request and response > > > fileids in > > > an NFS request, the kernel logs an error and attempts to return > > > ESTALE. > > > However, this error is currently dropped before it makes it all > > > the way > > > to userspace. This appears to be a mistake, since as far as I can > > > tell > > > that ESTALE value is never consumed from anywhere. > > > > > > Callstack for async NFS write, at time of error: > > > > > > nfs_update_inode <- returns -ESTALE > > > nfs_refresh_inode_locked > > > nfs_writeback_update_inode <- error is dropped here > > > nfs3_write_done > > > nfs_writeback_done > > > nfs_pgio_result <- other errors are collected here > > > rpc_exit_task > > > __rpc_execute > > > rpc_async_schedule > > > process_one_work > > > worker_thread > > > kthread > > > ret_from_fork > > > > > > We ran into this issue ourselves, and seeing the -ESTALE in the > > > kernel > > > source code but not from userspace was surprising. > > > > > > I tested a rebased version of this patch on an el8 kernel > > > (v6.1.114), > > > and it seems to correctly propagate this error. > > > > > > > 8------------------------------------------------------8< > > > > > > If an NFS server returns a response with a different file id to > > > the > > > response, the kernel currently prints out an error and attempts > > > to > > > return -ESTALE. However, this -ESTALE value is never surfaced > > > anywhere. > > > > Hi Nikhil Jha, > > > > Will this cause us to return -ESTALE to the application even if the > > WRITE > > was successful? > > > > Ben > > > > Hi Ben, > > Hmm.. I'm not sure how to answer that question exactly. A fileid is > only > mismatched between a request RPC and response RPC when something > really > weird is going on (i.e. a bug in the NFS server), so it's hard to > reason > about if a WRITE was successful or not. > > The `return -ESTALE` was already in the kernel code, but this > particular > codepath seems to have accidentally dropped that error. > > Nikhil > I'm not keen on taking any client changes to paper over what appears to be a major server bug. We simply can't support broken servers that reuse filehandles across files. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx