On Thu, 2020-12-03 at 20:41 -0500, bfields@xxxxxxxxxxxx wrote: > On Fri, Dec 04, 2020 at 01:02:20AM +0000, Trond Myklebust wrote: > > On Thu, 2020-12-03 at 18:16 -0500, bfields@xxxxxxxxxxxx wrote: > > > On Thu, Dec 03, 2020 at 10:53:26PM +0000, Trond Myklebust wrote: > > > > On Thu, 2020-12-03 at 17:45 -0500, bfields@xxxxxxxxxxxx wrote: > > > > > On Thu, Dec 03, 2020 at 09:34:26PM +0000, Trond Myklebust > > > > > wrote: > > > > > > I've been wanting such a function for quite a while anyway > > > > > > in > > > > > > order to allow the client to detect state leaks (either due > > > > > > to > > > > > > soft timeouts, or due to reordered close/open operations). > > > > > > > > > > One sure way to fix any state leaks is to reboot the server. > > > > > The > > > > > server throws everything away, the clients reclaim, all > > > > > that's > > > > > left > > > > > is stuff they still actually care about. > > > > > > > > > > It's very disruptive. > > > > > > > > > > But you could do a limited version of that: the server throws > > > > > away > > > > > the state from one client (keeping the underlying locks on > > > > > the > > > > > exported filesystem), lets the client go through its normal > > > > > reclaim > > > > > process, at the end of that throws away anything that wasn't > > > > > reclaimed. The only delay is to anyone trying to acquire new > > > > > locks > > > > > that conflict with that set of locks, and only for as long as > > > > > it > > > > > takes for the one client to reclaim. > > > > > > > > One could do that, but that requires the existence of a > > > > quiescent > > > > period where the client holds no state at all on the server. > > > > > > No, as I said, the client performs reboot recovery for any state > > > that > > > it > > > holds when we do this. > > > > > > > Hmm... So how do the client and server coordinate what can and > > cannot > > be reclaimed? The issue is that races can work both ways, with the > > client sometimes believing that it holds a layout or a delegation > > that > > the server thinks it has returned. If the server allows a reclaim > > of > > such a delegation, then that could be problematic (because it > > breaks > > lock atomicity on the client and because it may cause conflicts). > > The server's not actually forgetting anything, it's just pretending > to, > in order to trigger the client's reboot recovery. It can turn down > the > client's attempt to reclaim something it doesn't have. > > Though isn't it already game over by the time the client thinks it > holds > some lock/open/delegation that the server doesn't? I guess I'd need > to > see these cases written out in detail to understand. > Normally, the server will return NFS4ERR_BAD_STATEID or NFS4ERR_OLD_STATEID if the client tries to use an invalid stateid. The issue here is that you'd be discarding that machinery, because the client is forgetting its stateids when it gets told that the server rebooted. That again puts the onus on the server to verify more strongly whether or not the client is recovering state that it actually holds. So to elaborate a little more on the cases where we have seen the client and server state mess up here. Typically it happens when we build COMPOUNDS where there is a stateful operation followed by a slow operation. Something like Thread 1 ======== OPEN(foo) + LAYOUTGET -> openstateid(01: blah) Thread 2 ======== OPEN(foo) ->openstateid(02: blah) CLOSE(openstateid(02:blah)) (gets reply from OPEN). Typically the client forgets about the stateid after the CLOSE, so when it gets a reply to the original OPEN, it thinks it just got a completely fresh stateid "openstateid(01: blah)", which it might try to reclaim if the server declares a reboot. > --b. > > > By the way, the other thing that I'd like to add to my wishlist is > > a > > callback that allows the server to ask the client if it still holds > > a > > given open or lock stateid. A server can recall a delegation or a > > layout, so it can fix up leaks of those, however it has no remedy > > if > > the client loses an open or lock stateid other than to possibly > > forcibly revoke state. That could cause application crashes if the > > server makes a mistake and revokes a lock that is actually in use. > > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx