Re: [PATCH 3/6] NFSv4: change nfs4_do_setattr to take an open_context instead of a nfs4_state.

NeilBrown <neilb@xxxxxxxx> · Thu, 03 Nov 2016 10:34:56 +1100

On Thu, Nov 03 2016, Benjamin Coddington wrote:

> On 13 Oct 2016, at 0:26, NeilBrown wrote:

>> @@ -3694,20 +3695,17 @@ nfs4_proc_setattr(struct dentry *dentry, 
>> struct nfs_fattr *fattr,
>>
>>  	/* Search for an existing open(O_WRITE) file */
>>  	if (sattr->ia_valid & ATTR_FILE) {
>> -		struct nfs_open_context *ctx;
>>
>>  		ctx = nfs_file_open_context(sattr->ia_file);
>> -		if (ctx) {
>> +		if (ctx)
>>  			cred = ctx->cred;
>> -			state = ctx->state;
>> -		}
>>  	}
>
>
> Does this need a get_nfs_open_context() there to make sure the open 
> context
> doesn't drop away?

I can't see why you would.  The ia_file must hold a reference to the
ctx, and this code doesn't keep any reference to the ctx after
nfs4_proc_setattr completes - does it?

>     I'm getting this on generic/089:
>
> [  651.855291] run fstests generic/089 at 2016-11-01 11:15:57
> [  652.645828] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
> [  653.166259] NFSD: client ::1 testing state ID with incorrect client 
> ID
> [  653.167218] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000018

I think this BUG is happening in nfs41_check_expired_locks.
This:
	list_for_each_entry(lsp, &state->lock_states, ls_locks) {
walks off the end of a list, finding a NULL on a list which should never
have a NULL pointer.  That does suggest a use-after-free of an
nfs4_lock_state, or possibly of an nfs4_state.

I can't see it in the code yet though.

>
> Something else is also wrong there.. wrapping that with
> get_nfs_open_context() makes the crash go away, but there are still 
> several
> "NFS: nfs4_reclaim_open_state: Lock reclaim failed!" in the log.   Why 
> would
> we be doing reclaim at all?  I'll look at a network capture next.

The
> [  653.166259] NFSD: client ::1 testing state ID with incorrect client ID

errors suggests that the client is sending a stateid that doesn't match
the client id, so the server reports and error and the client enters
state recovery.
Maybe one thread is dropping a flock lock while another thread is using
it for some IO and they race?  I think there are refcounts in place to
protect that but something might be missing.

I look forward to seeing the network capture.

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature