Re: Question about nlmclnt_lock

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Sat, 6 Aug 2022 02:26:55 +0000

On Fri, 2022-08-05 at 19:17 -0400, Jan Kasiak wrote:
> Hi,
> 
> I was looking at the code for nlmclnt_lock and wanted to ask a
> question about how the Linux kernel client and the NLM 4 protocol
> handle some errors around certain edge cases.
> 
> Specifically, I think there is a race condition around two threads of
> the same program acquiring a lock, one of the threads being
> interrupted, and the NFS client sending an unlock when none of the
> program threads called unlock.
> 
> On NFS server machine S:
> there exists an unlocked file F
> 
> On NFS client machine C:
> in program P:
> thread 1 tries to lock(F) with fd A
> thread 2 tries to lock(F) with fd B
> 
> The Linux client will issue two NLM_LOCK calls with the same svid and
> same range, because it uses the program id to map to an svid.
> 
> For whatever reason, assume the connection is broken (cable gets
> pulled etc...)
> and `status = nlmclnt_call(cred, req, NLMPROC_LOCK);` fails.
> 
> The Linux client will retry the request, but at some point thread 1
> receives a signal and nlmclnt_lock breaks out of its loop. Because
> the
> Linux client request failed, it will fall through and go to the
> out_unlock label, where it will want to send an unlock request.
> 
> Assume that at some point the connection is reestablished.
> 
> The Linux kernel client now has two outstanding lock requests to send
> to the remote server: one for a lock that thread 2 is still trying to
> acquire, and one for an unlock of thread 1 that failed and was
> interrupted.
> 
> I'm worried that the Linux client may first send the lock request,
> and
> tell thread 2 that it acquired the lock, and then send an unlock
> request from the cancelled thread 1 request.
> 
> The server will successfully process both requests, because the svid
> is the same for both, and the true server side state will be that the
> file is unlocked.
> 
> One can talk about the wisdom of using multiple threads to acquire
> the
> same file lock, but this behavior is weird, because none of the
> threads called unlock.
> 
> I have experimented with reproducing this, but have not been
> successful in triggering this ordering of events.
> 
> I've also looked at the code of in clntproc.c and I don't see a spot
> where outstanding failed lock/unlock requests are checked while
> processing lock requests?
> 
> Thanks,
> -Jan

Nobody here is likely to want to waste much time trying to 'fix' the
NLM locking protocol. The protocol itself is known to be extremely
fragile, and the endemic problems constitute some of the main
motivations for the development of the NFSv4 protocol
(See https://datatracker.ietf.org/doc/html/rfc2624#section-8
and https://datatracker.ietf.org/doc/html/rfc7530#section-9).

If you need more reliable support for POSIX locks beyond what exists
today for NLM, then please consider NFSv4.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx