> On Jun 14, 2024, at 2:26 PM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Fri, Jun 14, 2024 at 05:46:21PM +0000, Chuck Lever III wrote: >>> Reservation means another node has an active reservation on that LU. >> >> There are only two accessors of the LUN: the NFS server and >> the NFS client running the test. That's why these errors are >> a little surprising to me. > > You can create registrations from userspace, and some cluster managers > do that. But none of that should happen for a default setup. > >>> When pNFS layout access fails we fall back to normal access through the >>> MDS, so this is expected. >> >> Expected, OK. From a usability standpoint, error messages like >> this would probably be alarming to administrators. I plan to >> convert the printk's and dprintk's in the NFSD layout code into >> trace points, but that doesn't help the messages emitted by the >> block and SCSI drivers. Ideally this should be less noisy. > > Well, they really should be alarming because the admin configured > a block layout setup and it did not work as expected. So it should > ring alarm bells. Yes, I expect that "pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001 ..." is very likely operator error. >>> Is generic/069 that first test that failed when doing a full xfstests >>> run? >> >> Yes, it's a full run. generic/069 is the first test where there >> are remarkable system journal messages (ie, PR errors), though >> there are a few subsequent tests that are also whinging. > > Interesting. Normally only the server actually reserves the LU, > the clients just register. And something went wrong here and only > for these tests. I just checked the NFS server's system journal, and there's nothing interesting there. FWIW, the other two tests that emit unexpected journal messages that I noted down are generic/108 and generic/616. >>> Do you see LAYOUT* ops in /proc/self/mountstats for the previous >>> tests? >> >> generic/013 is known to generate layout recalls, for example, >> so there is layout activity during the test run. > > Ok. The other thing would be to run blktrace on the client and > see that it shows I/O. But all this sounds like the tests in > general work, but something is up with generic/069. > > generic/069 just does O_APPEND writes, so I can't see what > would be so special about it. > >> >> I can go back and try reproducing with just generic/069 and >> tcpdump as a first step. Is there a way I can tell that the >> PR errors are not reporting a possible data corruption? > > xfstests in general does data verifycation to check for data integrity, > so we should not rely on kernel messages. > > I'm a bit busy right now, but I'll try to reproduce this locally next > week. Thanks, I'll also try to investigate further. -- Chuck Lever