On 2 Nov 2016, at 19:34, NeilBrown wrote: > On Thu, Nov 03 2016, Benjamin Coddington wrote: > >> On 13 Oct 2016, at 0:26, NeilBrown wrote: > >>> @@ -3694,20 +3695,17 @@ nfs4_proc_setattr(struct dentry *dentry, >>> struct nfs_fattr *fattr, >>> >>> /* Search for an existing open(O_WRITE) file */ >>> if (sattr->ia_valid & ATTR_FILE) { >>> - struct nfs_open_context *ctx; >>> >>> ctx = nfs_file_open_context(sattr->ia_file); >>> - if (ctx) { >>> + if (ctx) >>> cred = ctx->cred; >>> - state = ctx->state; >>> - } >>> } >> >> >> Does this need a get_nfs_open_context() there to make sure the open >> context >> doesn't drop away? > > I can't see why you would. The ia_file must hold a reference to the > ctx, and this code doesn't keep any reference to the ctx after > nfs4_proc_setattr completes - does it? I had a thought that the file pointer could be shared, and just checking to see if the file's private data was set didn't mean that this task set it. I'm probably wrong about that, and I'm definitely wrong about the fix, since it will crash taking a reference. It just took much longer this time. >> I'm getting this on generic/089: >> >> [ 651.855291] run fstests generic/089 at 2016-11-01 11:15:57 >> [ 652.645828] NFS: nfs4_reclaim_open_state: Lock reclaim failed! >> [ 653.166259] NFSD: client ::1 testing state ID with incorrect client >> ID >> [ 653.167218] BUG: unable to handle kernel NULL pointer dereference at >> 0000000000000018 > > I think this BUG is happening in nfs41_check_expired_locks. > This: > list_for_each_entry(lsp, &state->lock_states, ls_locks) { > walks off the end of a list, finding a NULL on a list which should never > have a NULL pointer. That does suggest a use-after-free of an > nfs4_lock_state, or possibly of an nfs4_state. > > I can't see it in the code yet though. Yes, walking off the list, and I had some idea that we were getting duplicate fl_owners from a bad lock_context, but today that doesn't make any sense to me. I can't see how this can happen either. >> >> Something else is also wrong there.. wrapping that with >> get_nfs_open_context() makes the crash go away, but there are still >> several >> "NFS: nfs4_reclaim_open_state: Lock reclaim failed!" in the log. Why >> would >> we be doing reclaim at all? I'll look at a network capture next. > > The >> [ 653.166259] NFSD: client ::1 testing state ID with incorrect client ID > > errors suggests that the client is sending a stateid that doesn't match > the client id, so the server reports and error and the client enters > state recovery. > Maybe one thread is dropping a flock lock while another thread is using > it for some IO and they race? I think there are refcounts in place to > protect that but something might be missing. > > I look forward to seeing the network capture. The network capture isn't very enlightening.. or I haven't been able to find where the problem happens in the 800Mb of capture. I probably need to make the box panic on oops and capture externally so the problem is right at the end of the capture. I'll keep plugging away at this as I can. Ben -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html