On 8/17/23 4:08 PM, Jeff Layton wrote:
On Thu, 2023-08-17 at 15:59 -0700, dai.ngo@xxxxxxxxxx wrote:
On 8/17/23 3:23 PM, dai.ngo@xxxxxxxxxx wrote:
On 8/17/23 2:07 PM, Jeff Layton wrote:
On Thu, 2023-08-17 at 13:15 -0400, Jeff Layton wrote:
On Thu, 2023-08-17 at 16:31 +0000, Chuck Lever III wrote:
On Aug 17, 2023, at 12:27 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
On Thu, 2023-08-17 at 11:17 -0400, Anna Schumaker wrote:
On Thu, Aug 17, 2023 at 10:22 AM Jeff Layton <jlayton@xxxxxxxxxx>
wrote:
On Thu, 2023-08-17 at 14:04 +0000, Chuck Lever III wrote:
On Aug 17, 2023, at 7:21 AM, Jeff Layton <jlayton@xxxxxxxxxx>
wrote:
I finally got my kdevops
(https://github.com/linux-kdevops/kdevops) test
rig working well enough to get some publishable results. To
run fstests,
kdevops will spin up a server and (in this case) 2 clients to run
xfstests' auto group. One client mounts with default options,
and the
other uses NFSv3.
I tested 3 kernels:
v6.4.0 (stock release)
6.5.0-rc6-g4853c74bd7ab (Linus' tree as of a couple of days ago)
6.5.0-rc6-next-20230816-gef66bf8aeb91 (linux-next as of
yesterday morning)
Here are the results summary of all 3:
KERNEL: 6.4.0
CPUS: 8
nfs_v3: 727 tests, 12 failures, 569 skipped, 14863 seconds
Failures: generic/053 generic/099 generic/105 generic/124
generic/193 generic/258 generic/294 generic/318 generic/319
generic/444 generic/528 generic/529
nfs_default: 727 tests, 18 failures, 452 skipped, 21899 seconds
Failures: generic/053 generic/099 generic/105 generic/186
generic/187 generic/193 generic/294 generic/318 generic/319
generic/357 generic/444 generic/486 generic/513 generic/528
generic/529 generic/578 generic/675 generic/688
Totals: 1454 tests, 1021 skipped, 30 failures, 0 errors, 35096s
KERNEL: 6.5.0-rc6-g4853c74bd7ab
CPUS: 8
nfs_v3: 727 tests, 9 failures, 570 skipped, 14775 seconds
Failures: generic/053 generic/099 generic/105 generic/258
generic/294 generic/318 generic/319 generic/444 generic/529
nfs_default: 727 tests, 16 failures, 453 skipped, 22326 seconds
Failures: generic/053 generic/099 generic/105 generic/186
generic/187 generic/294 generic/318 generic/319 generic/357
generic/444 generic/486 generic/513 generic/529 generic/578
generic/675 generic/688
Totals: 1454 tests, 1023 skipped, 25 failures, 0 errors, 35396s
KERNEL: 6.5.0-rc6-next-20230816-gef66bf8aeb91
CPUS: 8
nfs_v3: 727 tests, 9 failures, 570 skipped, 14657 seconds
Failures: generic/053 generic/099 generic/105 generic/258
generic/294 generic/318 generic/319 generic/444 generic/529
nfs_default: 727 tests, 18 failures, 453 skipped, 21757 seconds
Failures: generic/053 generic/099 generic/105 generic/186
generic/187 generic/294 generic/318 generic/319 generic/357
generic/444 generic/486 generic/513 generic/529 generic/578
generic/675 generic/683 generic/684 generic/688
Totals: 1454 tests, 1023 skipped, 27 failures, 0 errors, 34870s
As long as we're sharing results ... here is what I'm seeing with a
6.5-rc6 client & server:
anna@gouda ~ % xfstestsdb xunit list --results --runid 1741
--color=none
+------+----------------------+---------+----------+------+------+------+-------+
run | device | xunit | hostname | pass | fail |
skip | time |
+------+----------------------+---------+----------+------+------+------+-------+
1741 | server:/srv/xfs/test | tcp-3 | client | 125 | 4 |
464 | 447 s |
1741 | server:/srv/xfs/test | tcp-4.0 | client | 117 | 11 |
465 | 478 s |
1741 | server:/srv/xfs/test | tcp-4.1 | client | 119 | 12 |
462 | 404 s |
1741 | server:/srv/xfs/test | tcp-4.2 | client | 212 | 18 |
363 | 564 s |
+------+----------------------+---------+----------+------+------+------+-------+
anna@gouda ~ % xfstestsdb show --failure 1741 --color=none
+-------------+---------+---------+---------+---------+
testcase | tcp-3 | tcp-4.0 | tcp-4.1 | tcp-4.2 |
+-------------+---------+---------+---------+---------+
generic/053 | passed | failure | failure | failure |
generic/099 | passed | failure | failure | failure |
generic/105 | passed | failure | failure | failure |
generic/140 | skipped | skipped | skipped | failure |
generic/188 | skipped | skipped | skipped | failure |
generic/258 | failure | passed | passed | failure |
generic/294 | failure | failure | failure | failure |
generic/318 | passed | failure | failure | failure |
generic/319 | passed | failure | failure | failure |
generic/357 | skipped | skipped | skipped | failure |
generic/444 | failure | failure | failure | failure |
generic/465 | passed | failure | failure | failure |
generic/513 | skipped | skipped | skipped | failure |
generic/529 | passed | failure | failure | failure |
generic/604 | passed | passed | failure | passed |
generic/675 | skipped | skipped | skipped | failure |
generic/688 | skipped | skipped | skipped | failure |
generic/697 | passed | failure | failure | failure |
nfs/002 | failure | failure | failure | failure |
+-------------+---------+---------+---------+---------+
With NFSv4.2, v6.4.0 has 2 extra failures that the current
mainline
kernel doesn't:
generic/193 (some sort of setattr problem)
generic/528 (known problem with btime handling in client
that has been fixed)
While I haven't investigated, I'm assuming the 193 bug is also
something
that has been fixed in recent kernels. There are also 3 other
NFSv3
tests that started passing since v6.4.0. I haven't looked into
those.
With the linux-next kernel there are 2 new regressions:
generic/683
generic/684
Both of these look like problems with setuid/setgid stripping,
and still
need to be investigated. I have more verbose result info on
the test
failures if anyone is interested.
Interesting that I'm not seeing the 683 & 684 failures. What type of
filesystem is your server exporting?
btrfs
You are testing linux-next? I need to go back and confirm these
results
too.
IMO linux-next is quite important : we keep hitting bugs that
appear only after integration -- block and network changes in
other trees especially can impact the NFS drivers.
Indeed, I suspect this is probably something from the vfs tree (though
we definitely need to confirm that). Today I'm testing:
6.5.0-rc6-next-20230817-g47762f086974
Nope, I was wrong. I ran a bisect and it landed here. I confirmed it by
turning off leases on the nfs server and the test started passing. I
probably won't have the cycles to chase this down further.
The capture looks something like this:
OPEN (get a write delegation
WRITE
CLOSE
SETATTR (mode 06666)
...then presumably a task on the client opens the file again, but the
setuid bits don't get stripped.
I think either the client will need to strip these bits on a delegated
open, or we'll need to recall write delegations from the client when it
tries to do a SETATTR with a mode that could later end up needing to be
stripped on a subsequent open:
66ce3e3b98a7a9e970ea463a7f7dc0575c0a244b is the first bad commit
commit 66ce3e3b98a7a9e970ea463a7f7dc0575c0a244b
Author: Dai Ngo <dai.ngo@xxxxxxxxxx>
Date: Thu Jun 29 18:52:40 2023 -0700
NFSD: Enable write delegation support
The SETATTR should cause the delegation to be recalled. However, I think
there is an optimization on server that skips the recall if the SETATTR
comes from the same client that has the delegation.
The optimization on the server was done by this commit:
28df3d1539de nfsd: clients don't need to break their own delegations
Perhaps we should allow this optimization for read delegation only?
Or should the NFS client be responsible for handling the SETATTR and
and local OPEN on the file that has write delegation granted?
I think that setuid/setgid files are really a special case.
We already avoid giving out delegations on setuid/gid files. What we're
not doing currently is revoking the write delegation if the holder tries
to set a mode that involves a setuid/gid bit. If we add that, then that
should close the hole, I think.
This approach seems reasonable, I'll work the patch to take care of this
condition.
Thanks Jeff,
-Dai