Re: Question about nlmclnt_lock

Jan Kasiak <j.kasiak@xxxxxxxxx> · Sat, 6 Aug 2022 11:03:34 -0400

Hi Trond,

The v4 RFCs do mention protocol design flaws, but don't go into more detail.

I was trying to understand those flaws in order to understand how and
why v3 was problematic.

-Jan

On Fri, Aug 5, 2022 at 10:27 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Fri, 2022-08-05 at 19:17 -0400, Jan Kasiak wrote:
> > Hi,
> >
> > I was looking at the code for nlmclnt_lock and wanted to ask a
> > question about how the Linux kernel client and the NLM 4 protocol
> > handle some errors around certain edge cases.
> >
> > Specifically, I think there is a race condition around two threads of
> > the same program acquiring a lock, one of the threads being
> > interrupted, and the NFS client sending an unlock when none of the
> > program threads called unlock.
> >
> > On NFS server machine S:
> > there exists an unlocked file F
> >
> > On NFS client machine C:
> > in program P:
> > thread 1 tries to lock(F) with fd A
> > thread 2 tries to lock(F) with fd B
> >
> > The Linux client will issue two NLM_LOCK calls with the same svid and
> > same range, because it uses the program id to map to an svid.
> >
> > For whatever reason, assume the connection is broken (cable gets
> > pulled etc...)
> > and `status = nlmclnt_call(cred, req, NLMPROC_LOCK);` fails.
> >
> > The Linux client will retry the request, but at some point thread 1
> > receives a signal and nlmclnt_lock breaks out of its loop. Because
> > the
> > Linux client request failed, it will fall through and go to the
> > out_unlock label, where it will want to send an unlock request.
> >
> > Assume that at some point the connection is reestablished.
> >
> > The Linux kernel client now has two outstanding lock requests to send
> > to the remote server: one for a lock that thread 2 is still trying to
> > acquire, and one for an unlock of thread 1 that failed and was
> > interrupted.
> >
> > I'm worried that the Linux client may first send the lock request,
> > and
> > tell thread 2 that it acquired the lock, and then send an unlock
> > request from the cancelled thread 1 request.
> >
> > The server will successfully process both requests, because the svid
> > is the same for both, and the true server side state will be that the
> > file is unlocked.
> >
> > One can talk about the wisdom of using multiple threads to acquire
> > the
> > same file lock, but this behavior is weird, because none of the
> > threads called unlock.
> >
> > I have experimented with reproducing this, but have not been
> > successful in triggering this ordering of events.
> >
> > I've also looked at the code of in clntproc.c and I don't see a spot
> > where outstanding failed lock/unlock requests are checked while
> > processing lock requests?
> >
> > Thanks,
> > -Jan
>
> Nobody here is likely to want to waste much time trying to 'fix' the
> NLM locking protocol. The protocol itself is known to be extremely
> fragile, and the endemic problems constitute some of the main
> motivations for the development of the NFSv4 protocol
> (See https://datatracker.ietf.org/doc/html/rfc2624#section-8
> and https://datatracker.ietf.org/doc/html/rfc7530#section-9).
>
> If you need more reliable support for POSIX locks beyond what exists
> today for NLM, then please consider NFSv4.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>