On Thu, Jun 20, 2024 at 12:45:56PM -0400, Benjamin Coddington wrote: > On 20 Jun 2024, at 11:56, Benjamin Coddington wrote: > > > On 20 Jun 2024, at 11:46, Chuck Lever wrote: > > > >> On Thu, Jun 20, 2024 at 11:30:54AM -0400, Benjamin Coddington wrote: > >>> On 20 Jun 2024, at 10:34, Chuck Lever wrote: > >>> > >>>> On Thu, Jun 20, 2024 at 09:51:46AM -0400, Benjamin Coddington wrote: > >>>>> On 19 Jun 2024, at 13:39, cel@xxxxxxxxxx wrote: > >>>>> > >>>>>> From: Chuck Lever <chuck.lever@xxxxxxxxxx> > >>>>>> > >>>>>> During generic/069 runs with pNFS SCSI layouts, the NFS client emits > >>>>>> the following in the system journal: > >>>>>> > >>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) > >>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) > >>>>>> kernel: pNFS: failed to open device /dev/disk/by-id/dm-uuid-mpath-0x6001405e3366f045b7949eb8e4540b51 (-2) > >>>>>> kernel: pNFS: using block device sdb (reservation key 0x666b60901e7b26b3) > >>>>>> kernel: sd 6:0:0:1: reservation conflict > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#16 CDB: Write(10) 2a 00 00 00 00 50 00 00 08 00 > >>>>>> kernel: reservation conflict error, dev sdb, sector 80 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2 > >>>>>> kernel: sd 6:0:0:1: reservation conflict > >>>>>> kernel: sd 6:0:0:1: reservation conflict > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#18 CDB: Write(10) 2a 00 00 00 00 60 00 00 08 00 > >>>>>> kernel: sd 6:0:0:1: [sdb] tag#17 CDB: Write(10) 2a 00 00 00 00 58 00 00 08 00 > >>>>>> kernel: reservation conflict error, dev sdb, sector 96 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 > >>>>>> kernel: reservation conflict error, dev sdb, sector 88 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 > >>>>>> systemd[1]: fstests-generic-069.scope: Deactivated successfully. > >>>>>> systemd[1]: fstests-generic-069.scope: Consumed 5.092s CPU time. > >>>>>> systemd[1]: media-test.mount: Deactivated successfully. > >>>>>> systemd[1]: media-scratch.mount: Deactivated successfully. > >>>>>> kernel: sd 6:0:0:1: reservation conflict > >>>>>> kernel: failed to unregister PR key. > >>>>>> > >>>>>> This appears to be due to a race. bl_alloc_lseg() calls this: > >>>>>> > >>>>>> 561 static struct nfs4_deviceid_node * > >>>>>> 562 bl_find_get_deviceid(struct nfs_server *server, > >>>>>> 563 const struct nfs4_deviceid *id, const struct cred *cred, > >>>>>> 564 gfp_t gfp_mask) > >>>>>> 565 { > >>>>>> 566 struct nfs4_deviceid_node *node; > >>>>>> 567 unsigned long start, end; > >>>>>> 568 > >>>>>> 569 retry: > >>>>>> 570 node = nfs4_find_get_deviceid(server, id, cred, gfp_mask); > >>>>>> 571 if (!node) > >>>>>> 572 return ERR_PTR(-ENODEV); > >>>>>> > >>>>>> nfs4_find_get_deviceid() does a lookup without the spin lock first. > >>>>>> If it can't find a matching deviceid, it creates a new device_info > >>>>>> (which calls bl_alloc_deviceid_node, and that registers the device's > >>>>>> PR key). > >>>>>> > >>>>>> Then it takes the nfs4_deviceid_lock and looks up the deviceid again. > >>>>>> If it finds it this time, bl_find_get_deviceid() frees the spare > >>>>>> (new) device_info, which unregisters the PR key for the same device. > >>>>>> > >>>>>> Any subsequent I/O from this client on that device gets EBADE. > >>>>>> > >>>>>> The umount later unregisters the device's PR key again. > >>>>>> > >>>>>> To prevent this problem, register the PR key after the deviceid_node > >>>>>> lookup. > >>>>> > >>>>> Hi Chuck - nice catch, but I'm not seeing how we don't have the same problem > >>>>> after this patch, instead it just seems like it moves the race. What > >>>>> prevents another process waiting to take the nfs4_deviceid_lock from > >>>>> unregistering the same device? I think we need another way to signal > >>>>> bl_free_device that we don't want to unregister for the case where the new > >>>>> device isn't added to nfs4_deviceid_cache. > >>>> > >>>> That's a (related but) somewhat different issue. I haven't seen > >>>> that in practice so far. > >>>> > >>>> > >>>>> No good ideas yet - maybe we can use a flag set within the > >>>>> nfs4_deviceid_lock? > >>>> > >>>> Well this smells like a use for a reference count on the block > >>>> device, but fs/nfs doesn't control the definition of that data > >>>> structure. > >>> > >>> I think we need two things to fix this race: > >>> - a way to determine if the current thread is the one > >>> that added the device to the to the cache, if so do the register > >>> - a way to determine if we're freeing because we lost the race to the > >>> cache, if so don't un-register. > >> > >> My patch is supposed to deal with all of that already. Can you show > >> me specifically what is not getting handled by my proposed change? > > > > Well - I may be missing something, but it looks like with this patch you can > > still have: > > > > Thread > > A B > > > > nfs4_find_get_deviceid > > new{a} = nfs4_get_device_info > > locks nfs4_deviceid_lock > > nfs4_find_get_deviceid > > new{b} = nfs4_get_device_info > > spins on nfs4_deviceid_lock > > adds new{a} to the cache > > unlocks nfs4_deviceid_lock > > pr_register > > locks nfs4_deviceid_lock > > finds new{a} > > pr_UNregisters new{b} > > > > In this case, you end up with an unregistered device. > > Oh jeez Chuck, nevermind - I am missing something, that we check the the > new{b} pnfs_block_dev->pr_registred before unregistering it. > > So, actually - I think this patch does solve the problem. I apologize for > the noise here. Thanks! Was wondering, because I thought that was exactly the race I was trying to fix! -- Chuck Lever