On Fri, Jun 14, 2024 at 05:46:21PM +0000, Chuck Lever III wrote: > > Reservation means another node has an active reservation on that LU. > > There are only two accessors of the LUN: the NFS server and > the NFS client running the test. That's why these errors are > a little surprising to me. You can create registrations from userspace, and some cluster managers do that. But none of that should happen for a default setup. > > When pNFS layout access fails we fall back to normal access through the > > MDS, so this is expected. > > Expected, OK. From a usability standpoint, error messages like > this would probably be alarming to administrators. I plan to > convert the printk's and dprintk's in the NFSD layout code into > trace points, but that doesn't help the messages emitted by the > block and SCSI drivers. Ideally this should be less noisy. Well, they really should be alarming because the admin configured a block layout setup and it did not work as expected. So it should ring alarm bells. > > Is generic/069 that first test that failed when doing a full xfstests > > run? > > Yes, it's a full run. generic/069 is the first test where there > are remarkable system journal messages (ie, PR errors), though > there are a few subsequent tests that are also whinging. Interesting. Normally only the server actually reserves the LU, the clients just register. And something went wrong here and only for these tests. > > Do you see LAYOUT* ops in /proc/self/mountstats for the previous > > tests? > > generic/013 is known to generate layout recalls, for example, > so there is layout activity during the test run. Ok. The other thing would be to run blktrace on the client and see that it shows I/O. But all this sounds like the tests in general work, but something is up with generic/069. generic/069 just does O_APPEND writes, so I can't see what would be so special about it. > > I can go back and try reproducing with just generic/069 and > tcpdump as a first step. Is there a way I can tell that the > PR errors are not reporting a possible data corruption? xfstests in general does data verifycation to check for data integrity, so we should not rely on kernel messages. I'm a bit busy right now, but I'll try to reproduce this locally next week.