On Wed, 2016-06-08 at 12:10 -0400, Oleg Drokin wrote: > On Jun 8, 2016, at 6:58 AM, Jeff Layton wrote: > > > A simple way to confirm that might be to convert all of the read locks > > on the st_rwsem to write locks. That will serialize all of the open > > operations and should prevent that particular race from occurring. > > > > If that works, we'd probably want to fix it in a less heavy-handed way, > > but I'd have to think about how best to do that. > > So I looked at the call sites for nfs4_get_vfs_file(), how about something like this: > > after we grab the fp->fi_lock, we can do test_access(open->op_share_access, stp); > > If that returns true - just drop the spinlock and return EAGAIN. > > The callsite in nfs4_upgrade_open() would handle that by retesting the access map > again and either coming back in or more likely reusing the now updated stateid > (synchronised by the fi_lock again). > We probably need to convert the whole access map testing there to be under > fi_lock. > Something like: > nfs4_upgrade_open(struct svc_rqst *rqstp, struct nfs4_file *fp, struct svc_fh *cur_fh, struct nfs4_ol_stateid *stp, struct nfsd4_open *open) > { > __be32 status; > unsigned char old_deny_bmap = stp->st_deny_bmap; > > again: > + spin_lock(&fp->fi_lock); > if (!test_access(open->op_share_access, stp)) { > + spin_unlock(&fp->fi_lock); > + status = nfs4_get_vfs_file(rqstp, fp, cur_fh, stp, open); > + if (status == -EAGAIN) > + goto again; > + return status; > + } > > /* test and set deny mode */ > - spin_lock(&fp->fi_lock); > status = nfs4_file_check_deny(fp, open->op_share_deny); > > > The call in nfsd4_process_open2() I think cannot hit this condition, right? > probably can add a WARN_ON there? BUG_ON? more sensible approach? > > Alternatively we can probably always call nfs4_get_vfs_file() under this spinlock, > just have it drop that for the open and then reobtain (already done), not as transparent I guess. > Yeah, I think that might be best. It looks like things could change after you drop the spinlock with the patch above. Since we have to retake it anyway in nfs4_get_vfs_file, we can just do it there. > Or the fi_lock might be converted to say a mutex, so we can sleep with it held and > then we can hold it across whole invocation of nfs4_get_vfs_file() and access testing and stuff. I think we'd be better off taking the st_rwsem for write (maybe just turning it into a mutex). That would at least be per-stateid instead of per-inode. That's a fine fix for now. It might slow down a client slightly that is sending two stateid morphing operations in parallel, but they shouldn't affect each other. I'm liking that solution more and more here. Longer term, I think we need to further simplify OPEN handling. It has gotten better, but it's still really hard to follow currently (and is obviously error-prone). -- Jeff Layton <jlayton@xxxxxxxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html