On Wed, 2024-01-10 at 11:03 -0500, Anna Schumaker wrote: > Hi Jeff, > > On Wed, Jan 10, 2024 at 9:30 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > I've been seeing some failures of generic/465 across all NFS versions > > for a long time. I finally had some time to track down the cause, but > > I'm not quite sure whether it's fixable. > > > > The test failures usually look like this (though often at a random > > offset): > > > > SECTION -- default > > FSTYP -- nfs > > PLATFORM -- Linux/x86_64 kdevops-nfs-default 6.7.0-g2f76af849100 #80 SMP PREEMPT_DYNAMIC Wed Jan 10 06:33:59 EST 2024 > > MKFS_OPTIONS -- kdevops-nfsd:/export/kdevops-nfs-default-fstests-s > > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 kdevops-nfsd:/export/kdevops-nfs-default-fstests-s /media/scratch > > > > generic/465 8s ... - output mismatch (see /data/fstests-install/xfstests/results/kdevops-nfs-default/6.7.0-g2f76af849100/default/generic/465.out.bad) > > --- tests/generic/465.out 2024-01-10 06:39:53.500389434 -0500 > > +++ /data/fstests-install/xfstests/results/kdevops-nfs-default/6.7.0-g2f76af849100/default/generic/465.out.bad 2024-01-10 08:57:00.536146701 -0500 > > @@ -1,3 +1,4 @@ > > QA output created by 465 > > non-aio dio test > > +encounter an error: block 117 offset 0, content 0 > > Looking through my test history, I have this one mostly passing but > with the occasional failure that looks like this. > > > aio-dio test > > ... > > (Run 'diff -u /data/fstests-install/xfstests/tests/generic/465.out /data/fstests-install/xfstests/results/kdevops-nfs-default/6.7.0-g2f76af849100/default/generic/465.out.bad' to see the entire diff) > > Ran: generic/465 > > Failures: generic/465 > > Failed 1 of 1 tests > > > > The test kicks off a thread that tries to read the file using DIO while > > the parent task writes 1M blocks of 'a' to it sequentially using DIO. It > > expects that the reader will always see 'a' in the read result, or a > > short read. In the above case, it got back a read with '\0' in it. > > > > The blocks in this test are 1M, so this block starts at offset > > 122683392. Looking at a capture, I caught this: > > > > 65161 40.392338 192.168.122.173 → 192.168.122.227 NFS 3702 V4 Call WRITE StateID: 0x9e68 Offset: 123207680 Len: 524288 ; V4 Call READ_PLUS StateID: 0x9e68 Offset: 122683392 Len: 524288 ; V4 Call READ_PLUS StateID: 0x9e68 Offset: 123207680 Len: 524288 > > 65171 40.393230 192.168.122.173 → 192.168.122.227 NFS 3286 V4 Call WRITE StateID: 0x9e68 Offset: 122683392 Len: 524288 > > 65172 40.393401 192.168.122.227 → 192.168.122.173 NFS 182 V4 Reply (Call In 65161) WRITE > > 65181 40.394844 192.168.122.227 → 192.168.122.173 NFS 6794 V4 Reply (Call In 65161) READ_PLUS > > 65195 40.395506 192.168.122.227 → 192.168.122.173 NFS 6794 V4 Reply (Call In 65161) READ_PLUS > > > > It looks like the DIO writes got reordered here so the size of the file > > probably increased briefly before the second write got processed, and > > the read_plus got processed in between the two. > > > > While we might be able to force the client to send the WRITEs in order > > of increasing offset in this situation, the server is under no > > obligation to process concurrent RPCs in any particular order. I don't > > think this is fundamentally fixable due to that. > > > > Am I wrong? If not, then I'll plan to send an fstests patch to skip this > > test on NFS. > > I'm cool with this one being skipped. I'm assuming it passing in my > setup is mostly accidental, which means it's not a very useful test. > That's my thinking too. It mostly passes, but occasionally the DIO reads and writes race and things fail. We probably could fix this if we were willing to serialize DIO reads vs. writes, but I think that would be too costly. I'll spin up a patch for xfstests sometime soon. Thanks! -- Jeff Layton <jlayton@xxxxxxxxxx>