On Tue, 2023-08-22 at 09:07 -0700, dai.ngo@xxxxxxxxxx wrote: > On 8/17/23 4:08 PM, Jeff Layton wrote: > > On Thu, 2023-08-17 at 15:59 -0700, dai.ngo@xxxxxxxxxx wrote: > > > On 8/17/23 3:23 PM, dai.ngo@xxxxxxxxxx wrote: > > > > On 8/17/23 2:07 PM, Jeff Layton wrote: > > > > > On Thu, 2023-08-17 at 13:15 -0400, Jeff Layton wrote: > > > > > > On Thu, 2023-08-17 at 16:31 +0000, Chuck Lever III wrote: > > > > > > > > On Aug 17, 2023, at 12:27 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Thu, 2023-08-17 at 11:17 -0400, Anna Schumaker wrote: > > > > > > > > > On Thu, Aug 17, 2023 at 10:22 AM Jeff Layton <jlayton@xxxxxxxxxx> > > > > > > > > > wrote: > > > > > > > > > > On Thu, 2023-08-17 at 14:04 +0000, Chuck Lever III wrote: > > > > > > > > > > > > On Aug 17, 2023, at 7:21 AM, Jeff Layton <jlayton@xxxxxxxxxx> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > I finally got my kdevops > > > > > > > > > > > > (https://github.com/linux-kdevops/kdevops) test > > > > > > > > > > > > rig working well enough to get some publishable results. To > > > > > > > > > > > > run fstests, > > > > > > > > > > > > kdevops will spin up a server and (in this case) 2 clients to run > > > > > > > > > > > > xfstests' auto group. One client mounts with default options, > > > > > > > > > > > > and the > > > > > > > > > > > > other uses NFSv3. > > > > > > > > > > > > > > > > > > > > > > > > I tested 3 kernels: > > > > > > > > > > > > > > > > > > > > > > > > v6.4.0 (stock release) > > > > > > > > > > > > 6.5.0-rc6-g4853c74bd7ab (Linus' tree as of a couple of days ago) > > > > > > > > > > > > 6.5.0-rc6-next-20230816-gef66bf8aeb91 (linux-next as of > > > > > > > > > > > > yesterday morning) > > > > > > > > > > > > > > > > > > > > > > > > Here are the results summary of all 3: > > > > > > > > > > > > > > > > > > > > > > > > KERNEL: 6.4.0 > > > > > > > > > > > > CPUS: 8 > > > > > > > > > > > > > > > > > > > > > > > > nfs_v3: 727 tests, 12 failures, 569 skipped, 14863 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/124 > > > > > > > > > > > > generic/193 generic/258 generic/294 generic/318 generic/319 > > > > > > > > > > > > generic/444 generic/528 generic/529 > > > > > > > > > > > > nfs_default: 727 tests, 18 failures, 452 skipped, 21899 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/186 > > > > > > > > > > > > generic/187 generic/193 generic/294 generic/318 generic/319 > > > > > > > > > > > > generic/357 generic/444 generic/486 generic/513 generic/528 > > > > > > > > > > > > generic/529 generic/578 generic/675 generic/688 > > > > > > > > > > > > Totals: 1454 tests, 1021 skipped, 30 failures, 0 errors, 35096s > > > > > > > > > > > > > > > > > > > > > > > > KERNEL: 6.5.0-rc6-g4853c74bd7ab > > > > > > > > > > > > CPUS: 8 > > > > > > > > > > > > > > > > > > > > > > > > nfs_v3: 727 tests, 9 failures, 570 skipped, 14775 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/258 > > > > > > > > > > > > generic/294 generic/318 generic/319 generic/444 generic/529 > > > > > > > > > > > > nfs_default: 727 tests, 16 failures, 453 skipped, 22326 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/186 > > > > > > > > > > > > generic/187 generic/294 generic/318 generic/319 generic/357 > > > > > > > > > > > > generic/444 generic/486 generic/513 generic/529 generic/578 > > > > > > > > > > > > generic/675 generic/688 > > > > > > > > > > > > Totals: 1454 tests, 1023 skipped, 25 failures, 0 errors, 35396s > > > > > > > > > > > > > > > > > > > > > > > > KERNEL: 6.5.0-rc6-next-20230816-gef66bf8aeb91 > > > > > > > > > > > > CPUS: 8 > > > > > > > > > > > > > > > > > > > > > > > > nfs_v3: 727 tests, 9 failures, 570 skipped, 14657 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/258 > > > > > > > > > > > > generic/294 generic/318 generic/319 generic/444 generic/529 > > > > > > > > > > > > nfs_default: 727 tests, 18 failures, 453 skipped, 21757 seconds > > > > > > > > > > > > Failures: generic/053 generic/099 generic/105 generic/186 > > > > > > > > > > > > generic/187 generic/294 generic/318 generic/319 generic/357 > > > > > > > > > > > > generic/444 generic/486 generic/513 generic/529 generic/578 > > > > > > > > > > > > generic/675 generic/683 generic/684 generic/688 > > > > > > > > > > > > Totals: 1454 tests, 1023 skipped, 27 failures, 0 errors, 34870s > > > > > > > > > As long as we're sharing results ... here is what I'm seeing with a > > > > > > > > > 6.5-rc6 client & server: > > > > > > > > > > > > > > > > > > anna@gouda ~ % xfstestsdb xunit list --results --runid 1741 > > > > > > > > > --color=none > > > > > > > > > +------+----------------------+---------+----------+------+------+------+-------+ > > > > > > > > > > > > > > > > > > > run | device | xunit | hostname | pass | fail | > > > > > > > > > skip | time | > > > > > > > > > +------+----------------------+---------+----------+------+------+------+-------+ > > > > > > > > > > > > > > > > > > > 1741 | server:/srv/xfs/test | tcp-3 | client | 125 | 4 | > > > > > > > > > 464 | 447 s | > > > > > > > > > > 1741 | server:/srv/xfs/test | tcp-4.0 | client | 117 | 11 | > > > > > > > > > 465 | 478 s | > > > > > > > > > > 1741 | server:/srv/xfs/test | tcp-4.1 | client | 119 | 12 | > > > > > > > > > 462 | 404 s | > > > > > > > > > > 1741 | server:/srv/xfs/test | tcp-4.2 | client | 212 | 18 | > > > > > > > > > 363 | 564 s | > > > > > > > > > +------+----------------------+---------+----------+------+------+------+-------+ > > > > > > > > > > > > > > > > > > > > > > > > > > > anna@gouda ~ % xfstestsdb show --failure 1741 --color=none > > > > > > > > > +-------------+---------+---------+---------+---------+ > > > > > > > > > > testcase | tcp-3 | tcp-4.0 | tcp-4.1 | tcp-4.2 | > > > > > > > > > +-------------+---------+---------+---------+---------+ > > > > > > > > > > generic/053 | passed | failure | failure | failure | > > > > > > > > > > generic/099 | passed | failure | failure | failure | > > > > > > > > > > generic/105 | passed | failure | failure | failure | > > > > > > > > > > generic/140 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/188 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/258 | failure | passed | passed | failure | > > > > > > > > > > generic/294 | failure | failure | failure | failure | > > > > > > > > > > generic/318 | passed | failure | failure | failure | > > > > > > > > > > generic/319 | passed | failure | failure | failure | > > > > > > > > > > generic/357 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/444 | failure | failure | failure | failure | > > > > > > > > > > generic/465 | passed | failure | failure | failure | > > > > > > > > > > generic/513 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/529 | passed | failure | failure | failure | > > > > > > > > > > generic/604 | passed | passed | failure | passed | > > > > > > > > > > generic/675 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/688 | skipped | skipped | skipped | failure | > > > > > > > > > > generic/697 | passed | failure | failure | failure | > > > > > > > > > > nfs/002 | failure | failure | failure | failure | > > > > > > > > > +-------------+---------+---------+---------+---------+ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > With NFSv4.2, v6.4.0 has 2 extra failures that the current > > > > > > > > > > > > mainline > > > > > > > > > > > > kernel doesn't: > > > > > > > > > > > > > > > > > > > > > > > > generic/193 (some sort of setattr problem) > > > > > > > > > > > > generic/528 (known problem with btime handling in client > > > > > > > > > > > > that has been fixed) > > > > > > > > > > > > > > > > > > > > > > > > While I haven't investigated, I'm assuming the 193 bug is also > > > > > > > > > > > > something > > > > > > > > > > > > that has been fixed in recent kernels. There are also 3 other > > > > > > > > > > > > NFSv3 > > > > > > > > > > > > tests that started passing since v6.4.0. I haven't looked into > > > > > > > > > > > > those. > > > > > > > > > > > > > > > > > > > > > > > > With the linux-next kernel there are 2 new regressions: > > > > > > > > > > > > > > > > > > > > > > > > generic/683 > > > > > > > > > > > > generic/684 > > > > > > > > > > > > > > > > > > > > > > > > Both of these look like problems with setuid/setgid stripping, > > > > > > > > > > > > and still > > > > > > > > > > > > need to be investigated. I have more verbose result info on > > > > > > > > > > > > the test > > > > > > > > > > > > failures if anyone is interested. > > > > > > > > > Interesting that I'm not seeing the 683 & 684 failures. What type of > > > > > > > > > filesystem is your server exporting? > > > > > > > > > > > > > > > > > btrfs > > > > > > > > > > > > > > > > You are testing linux-next? I need to go back and confirm these > > > > > > > > results > > > > > > > > too. > > > > > > > IMO linux-next is quite important : we keep hitting bugs that > > > > > > > appear only after integration -- block and network changes in > > > > > > > other trees especially can impact the NFS drivers. > > > > > > > > > > > > > Indeed, I suspect this is probably something from the vfs tree (though > > > > > > we definitely need to confirm that). Today I'm testing: > > > > > > > > > > > > 6.5.0-rc6-next-20230817-g47762f086974 > > > > > > > > > > > Nope, I was wrong. I ran a bisect and it landed here. I confirmed it by > > > > > turning off leases on the nfs server and the test started passing. I > > > > > probably won't have the cycles to chase this down further. > > > > > > > > > > The capture looks something like this: > > > > > > > > > > OPEN (get a write delegation > > > > > WRITE > > > > > CLOSE > > > > > SETATTR (mode 06666) > > > > > > > > > > ...then presumably a task on the client opens the file again, but the > > > > > setuid bits don't get stripped. > > OPEN (get a write delegation > WRITE > CLOSE > SETATTR (mode 06666) > > The client continues with: > > (ALLOCATE,GETATTR) <<=== this is when the server stripped the SUID and SGID bit > READDIR ====> file mode shows 0666 (SUID & SGID were stripped) > READDIR ====> file mode shows 0666 (SUID & SGID were stripped) > DELERETURN > > Here is stack trace of ALLOCATE when the SUID & SGID were stripped: > > **** start of notify_change, notice the i_mode bits, SUID & SGID were set: > [notify_change]: d_iname[a] ia_valid[0x1a00] ia_mode[0x0] i_mode[0x8db6] [nfsd:2409:Mon Aug 21 23:05:31 2023] > KILL[0] KILL_SUID[1] KILL_SGID[1] > > **** end of notify_change, notice the i_mode bits, SUID & SGID were stripped: > [notify_change]: RET[0] d_iname[a] ia_valid[0x1a01] ia_mode[0x81b6] i_mode[0x81b6] [nfsd:2409:Mon Aug 21 23:05:31 2023] > > **** stack trace of notify_change comes from ALLOCATE: > Returning from: 0xffffffffb726e764 : notify_change+0x4/0x500 [kernel] > Returning to : 0xffffffffb726bf99 : __file_remove_privs+0x119/0x170 [kernel] > 0xffffffffb726cfad : file_modified_flags+0x4d/0x110 [kernel] > 0xffffffffc0a2330b : xfs_file_fallocate+0xfb/0x490 [xfs] > 0xffffffffb723e7d8 : vfs_fallocate+0x158/0x380 [kernel] > 0xffffffffc0ddc30a : nfsd4_vfs_fallocate+0x4a/0x70 [nfsd] > 0xffffffffc0def7f2 : nfsd4_allocate+0x72/0xc0 [nfsd] > 0xffffffffc0df2663 : nfsd4_proc_compound+0x3d3/0x730 [nfsd] > 0xffffffffc0dd633b : nfsd_dispatch+0xab/0x1d0 [nfsd] > 0xffffffffc0bda476 : svc_process_common+0x306/0x6e0 [sunrpc] > 0xffffffffc0bdb081 : svc_process+0x131/0x180 [sunrpc] > 0xffffffffc0dd4864 : nfsd+0x84/0xd0 [nfsd] > 0xffffffffb6f0bfd6 : kthread+0xe6/0x120 [kernel] > 0xffffffffb6e587d4 : ret_from_fork+0x34/0x50 [kernel] > 0xffffffffb6e03a3b : ret_from_fork_asm+0x1b/0x30 [kernel] > > I think the problem here is that the client does not update the file > attribute after ALLOCATE. The GETATTR in the ALLOCATE compound does > not include the mode bits. > Oh, interesting! Have you tried adding the FATTR4_MODE to that GETATTR call on the client? Does it also fix this? > The READDIR's reply show the test file's mode has the SUID & SGID bit > stripped (0666) but apparently these were not used o update the file > attribute. > > The test passes when server does not grant write delegation because: > > OPEN > WRITE > CLOSE > SETATTR (06666) > OPEN (CLAIM_FH, NOCREATE) > ALLOCATE <<=== server clear SUID & SGID > GETATTR, CLOSE <<=== GETATTR has mode bit as 0666, client updates file attribute > READDIR > READDIR > > As expected, if the server recalls the write delegation when SETATTR > with SUID/SGID set then the test passes. This is because it forces the > client to send the 2nd OPEN with CLAIM_FH, NOCREATE and then the > (GETATTR, CLOSE) which cause the client to update the file attribute. > What's your sense of the best way to fix this? The stripping of mode bits isn't covered by the NFSv4 spec, so this will ultimately come down to a judgment call. -- Jeff Layton <jlayton@xxxxxxxxxx>