On 20 Jun 2024, at 10:34, Chuck Lever wrote: > On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote: >> On 19 Jun 2024, at 13:39, cel@xxxxxxxxxx wrote: >> >>> From: Chuck Lever <chuck.lever@xxxxxxxxxx> >>> >>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits >>> the following in the system journal: >>> >>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) >>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) >>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) >>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) >>> kernel: sd 6:0:0:1: reservation conflict >>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s >>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00 >>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2 >>> kernel: sd 6:0:0:1: reservation conflict >>> kernel: sd 6:0:0:1: reservation conflict >>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s >>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s >>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00 >>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00 >>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 >>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 >>> systemd[1]: fstests-generic-069.scope: Deactivated successfully. >>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time. >>> systemd[1]: media-test.mount: Deactivated successfully. >>> systemd[1]: media-scratch.mount: Deactivated successfully. >>> kernel: sd 6:0:0:1: reservation conflict >>> kernel: failed to unregister PR key. >>> >>> This appears to be due to a race. bl_alloc_lseg() calls this: >>> >>> 561 static struct nfs4_deviceid_node * >>> 562 bl_find_get_deviceid(struct nfs_server *server, >>> 563 const struct nfs4_deviceid *id, const struct cred *cred, >>> 564 gfp_t gfp_mask) >>> 565 { >>> 566 struct nfs4_deviceid_node *node; >>> 567 unsigned long start, end; >>> 568 >>> 569 retry: >>> 570 node = nfs4_find_get_deviceid(server, id, cred, gfp_mask); >>> 571 if (!node) >>> 572 return ERR_PTR(-ENODEV); >>> >>> nfs4_find_get_deviceid() does a lookup without the spin lock first. >>> If it can't find a matching deviceid, it creates a new device_info >>> (which calls bl_alloc_deviceid_node, and that registers the device's >>> PR key). >>> >>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again. >>> If it finds it this time, bl_find_get_deviceid() frees the spare >>> (new) device_info, which unregisters the PR key for the same device. >>> >>> Any subsequent I/O from this client on that device gets EBADE. >>> >>> The umount later unregisters the device's PR key again. >>> >>> To prevent this problem, register the PR key after the deviceid_node >>> lookup. >> >> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem >> after this patch, instead it just seems like it moves the race. What >> prevents another process waiting to take the nfs4_deviceid_lock from >> unregistering the same device? I think we need another way to signal >> bl_free_device that we don't want to unregister for the case where the new >> device isn't added to nfs4_deviceid_cache. > > That's a (related but) somewhat different issue. I haven't seen > that in practice so far. > > >> No good ideas yet - maybe we can use a flag set within the >> nfs4_deviceid_lock? > > Well this smells like a use for a reference count on the block > device, but fs/nfs doesn't control the definition of that data > structure. I think we need two things to fix this race: - a way to determine if the current thread is the one that added the device to the to the cache, if so do the register - a way to determine if we're freeing because we lost the race to the cache, if so don't un-register. We can get both with a flag that's always set within the nfs4_deviceid_lock, something like NFS_DEVICEID_INIT. If set, it signals we need to register in the case we keep the device, or skip de-registration in the case where we lost the race and throw it out. We still need this patch to do the registration after it lands in the cache. Ben