Hi Trond, The v4 RFCs do mention protocol design flaws, but don't go into more detail. I was trying to understand those flaws in order to understand how and why v3 was problematic. -Jan On Fri, Aug 5, 2022 at 10:27 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Fri, 2022-08-05 at 19:17 -0400, Jan Kasiak wrote: > > Hi, > > > > I was looking at the code for nlmclnt_lock and wanted to ask a > > question about how the Linux kernel client and the NLM 4 protocol > > handle some errors around certain edge cases. > > > > Specifically, I think there is a race condition around two threads of > > the same program acquiring a lock, one of the threads being > > interrupted, and the NFS client sending an unlock when none of the > > program threads called unlock. > > > > On NFS server machine S: > > there exists an unlocked file F > > > > On NFS client machine C: > > in program P: > > thread 1 tries to lock(F) with fd A > > thread 2 tries to lock(F) with fd B > > > > The Linux client will issue two NLM_LOCK calls with the same svid and > > same range, because it uses the program id to map to an svid. > > > > For whatever reason, assume the connection is broken (cable gets > > pulled etc...) > > and `status = nlmclnt_call(cred, req, NLMPROC_LOCK);` fails. > > > > The Linux client will retry the request, but at some point thread 1 > > receives a signal and nlmclnt_lock breaks out of its loop. Because > > the > > Linux client request failed, it will fall through and go to the > > out_unlock label, where it will want to send an unlock request. > > > > Assume that at some point the connection is reestablished. > > > > The Linux kernel client now has two outstanding lock requests to send > > to the remote server: one for a lock that thread 2 is still trying to > > acquire, and one for an unlock of thread 1 that failed and was > > interrupted. > > > > I'm worried that the Linux client may first send the lock request, > > and > > tell thread 2 that it acquired the lock, and then send an unlock > > request from the cancelled thread 1 request. > > > > The server will successfully process both requests, because the svid > > is the same for both, and the true server side state will be that the > > file is unlocked. > > > > One can talk about the wisdom of using multiple threads to acquire > > the > > same file lock, but this behavior is weird, because none of the > > threads called unlock. > > > > I have experimented with reproducing this, but have not been > > successful in triggering this ordering of events. > > > > I've also looked at the code of in clntproc.c and I don't see a spot > > where outstanding failed lock/unlock requests are checked while > > processing lock requests? > > > > Thanks, > > -Jan > > Nobody here is likely to want to waste much time trying to 'fix' the > NLM locking protocol. The protocol itself is known to be extremely > fragile, and the endemic problems constitute some of the main > motivations for the development of the NFSv4 protocol > (See https://datatracker.ietf.org/doc/html/rfc2624#section-8 > and https://datatracker.ietf.org/doc/html/rfc7530#section-9). > > If you need more reliable support for POSIX locks beyond what exists > today for NLM, then please consider NFSv4. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >