Re: [PATCH 0/7] lockd: fix races that can result in stuck filelocks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 13, 2023 at 12:45 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> On Sun, 2023-03-12 at 17:33 +0200, Amir Goldstein wrote:
> > On Fri, Mar 3, 2023 at 4:54 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> > >
> > >
> > >
> > > > On Mar 3, 2023, at 7:15 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > I sent the first patch in this series the other day, but didn't get any
> > > > responses.
> > >
> > > We'll have to work out who will take which patches in this set.
> > > Once fully reviewed, I can take the set if the client maintainers
> > > send Acks for 2-4 and 6-7.
> > >
> > > nfsd-next for v6.4 is not yet open. I can work on setting that up
> > > today.
> > >
> > >
> > > > Since then I've had time to follow up on the client-side part
> > > > of this problem, which eventually also pointed out yet another bug on
> > > > the server side. There are also a couple of cleanup patches in here too,
> > > > and a patch to add some tracepoints that I found useful while diagnosing
> > > > this.
> > > >
> > > > With this set on both client and server, I'm now able to run Yongcheng's
> > > > test for an hour straight with no stuck locks.
> >
> > My nfstest_lock test occasionally gets into an endless wait loop for the lock in
> > one of the optests.

I forgot to mention that the regression is only with nfsversion=3!
Is anyone else running nfstest_lock with nfsversion=3?

> >
> > AFAIK, this started happening after I upgraded my client machine to v5.15.88.
> > Does this seem related to the client bug fixes in this patch set?
> >
> > If so, is this bug a regression? and why are the fixes aimed for v6.4?
> >
>
> Most of this (lockd) code hasn't changed in well over a decade, so if
> this is a regression then it's a very old one. I suppose it's possible
> that this regressed after the BKL was removed from this code, but that
> was a long time ago now and I'm not sure I can identify a commit that
> this fixes.
>
> I'm fine with this going in sooner than v6.4, but given that this has
> been broken so long, I didn't see the need to rush.
>

I don't know what is the relation of the optest regression that I am
experiencing and the client and server bugs mentioned in this patch set.
I just re-tested optest01 with several combinations of client-server kernels.
I rebooted both client and server before each test.
The results are a bit odd:

client           server      optest01 result
------------------------------------------------------
5.10.109     5.10.109  optest01 completes successfully after <30s
5.15.88       5.15.88    optest01 never completes (see attached log)
5.15.88       5.10.109  optest01 never completes
5.15.88+ [*] 5.15.88   optest01 never completes
5.15.88+     5.10.109  optest01 never completes
5.15.88+     5.15.88+  optest01 completes successfully after ~300s [**]

Unless I missed something with the tests, it looks like
1.a. There was a regressions in client from 5.10.109..5.15.88
1.b. The regression is manifested with both 5.10 and 5.15 servers
2.a. The patches improve the situation (from infinite to 30s per wait)...
2.b. ...but only when applied to both client and server and...
2.c. The situation is still a lot worse than 5.10 client with 5.10 server

Attached also the NFS[D] Kconfig which is identical for the tested
5.10 and 5.15 kernels.

Do you need me to provide any traces or any other info?

Thanks,
Amir.

[*] 5.15.88+ stands for 5.15.88 + the patches in this set, which all
apply cleanly
[**] The test takes 300s because every single 30s wait takes the entire 30s:

    DBG1: 15:21:47.118095 - Unlock file (F_UNLCK, F_SETLK) off=0 len=0
range(0, 18446744073709551615)
    DBG3: 15:21:47.119832 - Wait up to 30 secs to check if blocked
lock has been granted @253.87
    DBG3: 15:21:48.121296 - Check if blocked lock has been granted @254.87
...
    DBG3: 15:22:14.158314 - Check if blocked lock has been granted @280.90
    DBG3: 15:22:15.017594 - Getting results from blocked lock @281.76
    DBG1: 15:22:15.017832 - Unlock file (F_UNLCK, F_SETLK) off=0 len=0
range(0, 18446744073709551615) on second process @281.76
    PASS: Locking byte range (72 passed, 0 failed)

Attachment: optest01.nfsver3.linux-5.15.88.log.gz
Description: application/gzip

Attachment: 5.15.88.NFS.config
Description: Binary data


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux