hi Bruce, thanks for the response. this opens up a few questions about things i thought i understood initially, so i did a re-read of parts of the NFS 4.1 RFC (RFC 5661), and i would like to clarify some things further. see answers below: On Wed, Oct 14, 2020 at 10:27 PM J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote: > > On Sat, Oct 10, 2020 at 11:39:30PM +0300, guy keren wrote: > > during the design, we encountered some issues with high-availability > > and persistent sessions handling by the linux NFS client, and i > > would like to understand a few things about the linux NFS client - i > > read all relevant material on www.linux-nfs.org, and spent a while > > reading the relevant recovery code in the nfs4.1 client kernel > > sources, but i am missing some things (a pointer to the relevant > > part in the recovery code will be appreciated as well): > > > > > > 1. suppose there is a persistent session that got disconnected > > (because of a server restart, for example). i see that the client is > > re-sending all the in-flight commands as part of > > > > the recovery. however, suppose that one of the commands was a > > compound command containing 2 requests, and the reply to the first > > of them was NFS4_OK, and to the 2nd it was NFS4ERR_DELAY - will the > > client's code know that after it finishes recovery of the session - > > then when it creates a new session, it needs to re-send the 2nd > > request in this compound command? > > If the client received the reply, it shouldn't have to resend the > compound at all. > > If the client didn't see the reply, it will resend the whole compound. > Its behavior won't be affected by how the compound failed, since it > can't know that. according to what you wrote here, an NFS4ERR_DELAY response is something that needs to be sent at the level of the entire compound request - i.e. the server is not allowed to send a compound response where the first few requests have a status of NFS4_OK, while the last have a status of NFS4ERR_DELAY. i tried looking exactly where the spec specifies the possibility of the server sending an NFS4ERR_DELAY, and one example is on delegation recall. i am quoting from a paragraph from section 10.2 of the spec: =================== On recall, the client holding the delegation needs to flush modified state (such as modified data) to the server and return the delegation. The conflicting request will not be acted on until the recall is complete. The recall is considered complete when the client returns the delegation or the server times its wait for the delegation to be returned and revokes the delegation as a result of the timeout. In the interim, the server will either delay responding to conflicting requests or respond to them with NFS4ERR_DELAY. Following the resolution of the recall, the server has the information necessary to grant or deny the second client's request. =========================== according to what you say, if the OPEN request is in the middle of the compound request, and is preceded by state-modifying requests (e.g. creation of other files, writes into other open handles, renames, etc.), then the server must avoid processing them until it recalled the delegation to the file (i.e. it must process the entire command to make sure it doesn't need to send an NFS4ERR_DELAY response due to any of the requests inside it, before it starts processing, and it must also lock the state of all files involved in the request, to avoid another client acquiring a delegation on any of the files in the request that have an OPEN request in the same compound. alternatively, it must not send an NFS4ERR_DELAY request, and instead just keep the request pending until the delegation recall was completed. do i understand you correctly here? > > > the broader question is about a > > compound with N commands, where the first X have an NFS4_OK reply > > and the last N-X have NFS4_DELAY > > The server always stops processing a compound at the first failure, so > N-X is always <=1. granted. > > > - will the client re-send a new > > compound with the last N-X commands after establishing a new > > session? > > A resend by definition is a resend of exactly the same compound. The > client won't break it into pieces in that way. > > (And typical compounds can't be broken up that way anyway--often earlier > ops in the compound are things like PUTFH's that supply required > information to later ops.) i would assume that the same mechanism used to create the compound request in the first place (adding the PUTFH in front, etc.) could be used during a re-building of a smaller compound request - provided that the client knows which requests from the compound were already completed - and which were not. but i understand that there's no such mechanism today on the linux NFS client kernel - which is what i initially asked - so that clarifies things. > > > 2. if there is a non-persistent session, on which the client sent a > > non-idempotent request (e.g. rename of a file into a different > > directory), and the server restarted before the client received the > > response - will the client just blindly re-send the same request > > again after establishing a new session, or will it take some > > measures to attempt to understand whether the command was already > > executed? i.e. if the server already executed the rename, then > > re-sending it will return a failure to locate the source file handle > > (because it moved to a new directory). > > In a rename of A/X to B/Y, the source filehandle refers to the directory > "A", so that filehandle will still work. You might get a NFS4ERR_NOENT > if there's nothing at A/X any more, and you could guess that meant the > rename succeeded. But it could equally well be that your rename was > never executed, and it's somebody else's rename or unlink that caused > A/X to no longer exist. Similarly, the A/X might have executed but > another operation might have immediately created something else at A/X. i see. understood. > > > does the linux NFS client > > attempt to recover from this, or will it simply return an error to > > the application layer? > > I suspect that's all any client does. You can imagine all sorts of > complicated hueristics, but none of them will be 100% right. Persistent > sessions is what you really need to fix this kind of bug. what about a situation in which instead of a server restart event, the client just disconnected before receiving a rename response, and re-connected with the same session to the same session? in that case, i presume that the Linux NFS client will re-send the compound request, and get the results from the server's Duplicate-Request cache, without returning errors to the application. correct? > > > 3. what NFS server with persistent sessions is used (or was used) > > when testing the persistent sessions support in the linux NFS > > client? the linux NFS server, as far as i understood, cannot support > > persistent sessions (due to lack of assured persistent memory). > > I don't think any special hardware is necessary. Or if it is, we could > just disable the feature in the absence of that hardware. Mainly what > we need is some cooperation from the filesystem--some way the can ID > particular operations so the server can ask the filesystem if a > particular operation was committed to disk. I talked to the XFS > developers about it informally and they seemed open to the idea, but > they need some sort of explanation of the requirements and I haven't > gotten around to it.... you might also need the file system to be aware of delegations at some level, in order to break delegations held by NFS4 clients, when a local application attempts to open a file in a conflicting manner. and this doesn't answer the original question: how was the "persistent sessions" support in the linux NFS 4.1 client tested? when i tried to find an NFS 4.1 server that supports "persistent sessions" i first went to NetApp - and doing a "node takeover" operation on it revealed that the session is unknown on the 2nd node - making it practically irrelevant for such scenarios (unless there is some way to change the behaviour of this feature to behave more like SMB3 CA volumes). > > --b. on an aside - i see that you are also the maintainer of the pynfs test suite. would you be interested in patches fixing its install operation, and if yes - should we send them to this mailing list, or directly to you? i failed to find a mailing list dedicated to pynfs development. thanks, --guy keren Vast Data