Re: [PATCH/RFC] Don't try to recover NFS locks when they are lost.

Chuck Lever <chuck.lever@xxxxxxxxxx> · Fri, 16 Aug 2013 13:13:44 -0400

On Aug 16, 2013, at 9:30 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:

> On Fri, 16 Aug 2013 20:38:21 +1000
> NeilBrown <neilb@xxxxxxx> wrote:
> 
>> On Thu, 15 Aug 2013 08:47:06 -0400 Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>> 
>>> On Thu, 15 Aug 2013 12:36:04 +1000
>>> NeilBrown <neilb@xxxxxxx> wrote:
>>> 
>>>> 
>>>> 
>>>> When an NFS (V4 specifically) client loses contact with the server it can
>>>> lose any locks that it holds.
>>>> Currently when it reconnects to the server it simply tries to reclaim
>>>> those locks.  This might succeed even though some other client has held and
>>>> released a lock in the mean time.  So the first client might think the file
>>>> is unchanged, but it isn't.  This isn't good.
>>>> 
>>>> If, when recovery happens, the locks cannot be claimed because some other
>>>> client still holds the lock, then  we get a message in the kernel logs, but
>>>> the client can still write.  So two clients can both think they have a lock
>>>> and can both write at the same time.  This is equally not good.
>>>> 
>>>> There was a patch a while ago
>>>>  http://comments.gmane.org/gmane.linux.nfs/41917
>>>> 
>>>> which tried to address some of this, but it didn't seem to go anywhere.
>>>> That patch would also send a signal to the process.  That might be useful
>>>> but I'm really just interested in failing the writes.
>>>> For NFSv4 (unlike v2/v3) there is a strong link between the lock and the
>>>> write request so we can fairly easily fail an IO of the lock is gone.
>>>> 
>>>> The patch below attempts to do this.  Does it make sense?
>>>> Because this is a fairly big change I introduces a module parameter
>>>> "recover_locks" which defaults to true (the current behaviour) but can be set
>>>> to "false" to tell the client not to try to recover things that were lost.
>>>> 
>>>> Comments?
>>>> 
>>>> Thanks,
>>>> NeilBrown
>>>> 
>>>> 
>>> 
>>> Failing a read or write when we can't recover a lock over the range
>>> seems reasonable to me. IIUC though, you're also saying that we
>>> shouldn't try to recover locks when the lease has expired? If so, then
>>> that seems wrong...
>>> 
>>> Isn't it the responsibility of the server to not allow a lock to be
>>> reclaimed when there has been a conflicting lock in the interim? It's
>>> quite possible (and even advantageous) for a server to hold onto a lock
>>> for a client that has missed renewing its lease when no other client has
>>> made a conflicting lock request.
>> 
>> Hi Jeff,
>> I had thought that too.  But when I looked I could find no evidence for it.
>> The only time a client can 'reclaim' a lock is during the grace period when
>> the server might have lost the lock due to a reboot.
>> The case I'm looking at is when neither host rebooted but there was a network
>> partition.
>> I think that if the server is to preserve the lock while no other client
>> contends it, it has to preserve the whole state and not return
>> NFS4ERR_EXPIRED.
>> Once the client gets NFS4ERR_EXPIRED it must assume that all related locks
>> may have been subject to conflicting locks from other clients.
>> 
>> Thanks,
>> NeilBrown
>> 
>> 
> 
> Ahh, a very good point. So I guess to reiterate, NFS4ERR_EXPIRED means
> that the state (if there was any) has been purged, and all bets are
> off. So yeah, ok...trying to reclaim locks at that point is probably
> wrong.
> 
> That said...why is the server granting those lock reclaims in this case?
> Presumably the grace period has passed and it shouldn't be granting any
> reclaim requests, right?

After NFS4ERR_EXPIRED, our client uses a normal open ( OPEN(CLAIM_NULL) ).  That is allowed any time the server is out of its grace period.

> 
> 
> 
>>> 
>>>> 
>>>> diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
>>>> index f5c84c3..de0229b 100644
>>>> --- a/fs/nfs/nfs3proc.c
>>>> +++ b/fs/nfs/nfs3proc.c
>>>> @@ -826,9 +826,10 @@ static void nfs3_proc_read_setup(struct nfs_read_data *data, struct rpc_message
>>>> 	msg->rpc_proc = &nfs3_procedures[NFS3PROC_READ];
>>>> }
>>>> 
>>>> -static void nfs3_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> +static int nfs3_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> {
>>>> 	rpc_call_start(task);
>>>> +	return 0;
>>>> }
>>>> 
>>>> static int nfs3_write_done(struct rpc_task *task, struct nfs_write_data *data)
>>>> @@ -847,9 +848,10 @@ static void nfs3_proc_write_setup(struct nfs_write_data *data, struct rpc_messag
>>>> 	msg->rpc_proc = &nfs3_procedures[NFS3PROC_WRITE];
>>>> }
>>>> 
>>>> -static void nfs3_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> +static int nfs3_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> {
>>>> 	rpc_call_start(task);
>>>> +	return 0;
>>>> }
>>>> 
>>>> static void nfs3_proc_commit_rpc_prepare(struct rpc_task *task, struct nfs_commit_data *data)
>>>> diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
>>>> index ee81e35..a468b345 100644
>>>> --- a/fs/nfs/nfs4_fs.h
>>>> +++ b/fs/nfs/nfs4_fs.h
>>>> @@ -135,6 +135,7 @@ struct nfs4_lock_state {
>>>> 	struct list_head	ls_locks;	/* Other lock stateids */
>>>> 	struct nfs4_state *	ls_state;	/* Pointer to open state */
>>>> #define NFS_LOCK_INITIALIZED 0
>>>> +#define NFS_LOCK_LOST        1
>>>> 	unsigned long		ls_flags;
>>>> 	struct nfs_seqid_counter	ls_seqid;
>>>> 	nfs4_stateid		ls_stateid;
>>>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>>>> index cf11799..bcbcd07 100644
>>>> --- a/fs/nfs/nfs4proc.c
>>>> +++ b/fs/nfs/nfs4proc.c
>>>> @@ -3907,15 +3907,19 @@ static void nfs4_proc_read_setup(struct nfs_read_data *data, struct rpc_message
>>>> 	nfs41_init_sequence(&data->args.seq_args, &data->res.seq_res, 0);
>>>> }
>>>> 
>>>> -static void nfs4_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> +static int nfs4_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> {
>>>> 	if (nfs4_setup_sequence(NFS_SERVER(data->header->inode),
>>>> 			&data->args.seq_args,
>>>> 			&data->res.seq_res,
>>>> 			task))
>>>> -		return;
>>>> -	nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
>>>> -			data->args.lock_context, FMODE_READ);
>>>> +		return 0;
>>>> +	if (nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
>>>> +				data->args.lock_context, FMODE_READ) == -EIO)
>>>> +		return -EIO;
>>>> +	if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
>>>> +		return -EIO;
>>>> +	return 0;
>>>> }
>>>> 
>>>> static int nfs4_write_done_cb(struct rpc_task *task, struct nfs_write_data *data)
>>>> @@ -3990,15 +3994,19 @@ static void nfs4_proc_write_setup(struct nfs_write_data *data, struct rpc_messag
>>>> 	nfs41_init_sequence(&data->args.seq_args, &data->res.seq_res, 1);
>>>> }
>>>> 
>>>> -static void nfs4_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> +static int nfs4_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> {
>>>> 	if (nfs4_setup_sequence(NFS_SERVER(data->header->inode),
>>>> 			&data->args.seq_args,
>>>> 			&data->res.seq_res,
>>>> 			task))
>>>> -		return;
>>>> -	nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
>>>> -			data->args.lock_context, FMODE_WRITE);
>>>> +		return 0;
>>>> +	if (nfs4_set_rw_stateid(&data->args.stateid, data->args.context,
>>>> +				data->args.lock_context, FMODE_WRITE) == -EIO)
>>>> +		return -EIO;
>>>> +	if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
>>>> +		return -EIO;
>>>> +	return 0;
>>>> }
>>>> 
>>>> static void nfs4_proc_commit_rpc_prepare(struct rpc_task *task, struct nfs_commit_data *data)
>>>> @@ -5380,6 +5388,11 @@ static int nfs4_lock_reclaim(struct nfs4_state *state, struct file_lock *request
>>>> 	return err;
>>>> }
>>>> 
>>>> +bool recover_locks = true;
>>>> +module_param(recover_locks, bool, 0644);
>>>> +MODULE_PARM_DESC(recovery_locks,
>>>> +		 "If the server reports that a lock might be lost, "
>>>> +		 "try to recovery it risking corruption.");
>>>> static int nfs4_lock_expired(struct nfs4_state *state, struct file_lock *request)
>>>> {
>>>> 	struct nfs_server *server = NFS_SERVER(state->inode);
>>>> @@ -5391,6 +5404,10 @@ static int nfs4_lock_expired(struct nfs4_state *state, struct file_lock *request
>>>> 	err = nfs4_set_lock_state(state, request);
>>>> 	if (err != 0)
>>>> 		return err;
>>>> +	if (!recover_locks) {
>>>> +		set_bit(NFS_LOCK_LOST, &request->fl_u.nfs4_fl.owner->ls_flags);
>>>> +		return 0;
>>>> +	}
>>>> 	do {
>>>> 		if (test_bit(NFS_DELEGATED_STATE, &state->flags) != 0)
>>>> 			return 0;
>>>> diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
>>>> index e22862f..4d103ff 100644
>>>> --- a/fs/nfs/nfs4state.c
>>>> +++ b/fs/nfs/nfs4state.c
>>>> @@ -998,7 +998,9 @@ static int nfs4_copy_lock_stateid(nfs4_stateid *dst,
>>>> 	fl_pid = lockowner->l_pid;
>>>> 	spin_lock(&state->state_lock);
>>>> 	lsp = __nfs4_find_lock_state(state, fl_owner, fl_pid, NFS4_ANY_LOCK_TYPE);
>>>> -	if (lsp != NULL && test_bit(NFS_LOCK_INITIALIZED, &lsp->ls_flags) != 0) {
>>>> +	if (lsp && test_bit(NFS_LOCK_LOST, &lsp->ls_flags))
>>>> +		ret = -EIO;
>>>> +	else if (lsp != NULL && test_bit(NFS_LOCK_INITIALIZED, &lsp->ls_flags) != 0) {
>>>> 		nfs4_stateid_copy(dst, &lsp->ls_stateid);
>>>> 		ret = 0;
>>>> 		smp_rmb();
>>>> diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
>>>> index c041c41..a8f57c7 100644
>>>> --- a/fs/nfs/proc.c
>>>> +++ b/fs/nfs/proc.c
>>>> @@ -623,9 +623,10 @@ static void nfs_proc_read_setup(struct nfs_read_data *data, struct rpc_message *
>>>> 	msg->rpc_proc = &nfs_procedures[NFSPROC_READ];
>>>> }
>>>> 
>>>> -static void nfs_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> +static int nfs_proc_read_rpc_prepare(struct rpc_task *task, struct nfs_read_data *data)
>>>> {
>>>> 	rpc_call_start(task);
>>>> +	return 0;
>>>> }
>>>> 
>>>> static int nfs_write_done(struct rpc_task *task, struct nfs_write_data *data)
>>>> @@ -644,9 +645,10 @@ static void nfs_proc_write_setup(struct nfs_write_data *data, struct rpc_message
>>>> 	msg->rpc_proc = &nfs_procedures[NFSPROC_WRITE];
>>>> }
>>>> 
>>>> -static void nfs_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> +static int nfs_proc_write_rpc_prepare(struct rpc_task *task, struct nfs_write_data *data)
>>>> {
>>>> 	rpc_call_start(task);
>>>> +	return 0;
>>>> }
>>>> 
>>>> static void nfs_proc_commit_rpc_prepare(struct rpc_task *task, struct nfs_commit_data *data)
>>>> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
>>>> index 70a26c6..31db5c3 100644
>>>> --- a/fs/nfs/read.c
>>>> +++ b/fs/nfs/read.c
>>>> @@ -513,9 +513,10 @@ static void nfs_readpage_release_common(void *calldata)
>>>> void nfs_read_prepare(struct rpc_task *task, void *calldata)
>>>> {
>>>> 	struct nfs_read_data *data = calldata;
>>>> -	NFS_PROTO(data->header->inode)->read_rpc_prepare(task, data);
>>>> -	if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
>>>> -		rpc_exit(task, -EIO);
>>>> +	int err;
>>>> +	err = NFS_PROTO(data->header->inode)->read_rpc_prepare(task, data);
>>>> +	if (err)
>>>> +		rpc_exit(task, err);
>>>> }
>>>> 
>>>> static const struct rpc_call_ops nfs_read_common_ops = {
>>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>>>> index f1bdb72..7816801 100644
>>>> --- a/fs/nfs/write.c
>>>> +++ b/fs/nfs/write.c
>>>> @@ -1265,9 +1265,10 @@ EXPORT_SYMBOL_GPL(nfs_pageio_reset_write_mds);
>>>> void nfs_write_prepare(struct rpc_task *task, void *calldata)
>>>> {
>>>> 	struct nfs_write_data *data = calldata;
>>>> -	NFS_PROTO(data->header->inode)->write_rpc_prepare(task, data);
>>>> -	if (unlikely(test_bit(NFS_CONTEXT_BAD, &data->args.context->flags)))
>>>> -		rpc_exit(task, -EIO);
>>>> +	int err;
>>>> +	err = NFS_PROTO(data->header->inode)->write_rpc_prepare(task, data);
>>>> +	if (err)
>>>> +		rpc_exit(task, err);
>>>> }
>>>> 
>>>> void nfs_commit_prepare(struct rpc_task *task, void *calldata)
>>>> diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
>>>> index 8651574..c71e12b 100644
>>>> --- a/include/linux/nfs_xdr.h
>>>> +++ b/include/linux/nfs_xdr.h
>>>> @@ -1419,12 +1419,12 @@ struct nfs_rpc_ops {
>>>> 	void	(*read_setup)   (struct nfs_read_data *, struct rpc_message *);
>>>> 	void	(*read_pageio_init)(struct nfs_pageio_descriptor *, struct inode *,
>>>> 				    const struct nfs_pgio_completion_ops *);
>>>> -	void	(*read_rpc_prepare)(struct rpc_task *, struct nfs_read_data *);
>>>> +	int	(*read_rpc_prepare)(struct rpc_task *, struct nfs_read_data *);
>>>> 	int	(*read_done)  (struct rpc_task *, struct nfs_read_data *);
>>>> 	void	(*write_setup)  (struct nfs_write_data *, struct rpc_message *);
>>>> 	void	(*write_pageio_init)(struct nfs_pageio_descriptor *, struct inode *, int,
>>>> 				     const struct nfs_pgio_completion_ops *);
>>>> -	void	(*write_rpc_prepare)(struct rpc_task *, struct nfs_write_data *);
>>>> +	int	(*write_rpc_prepare)(struct rpc_task *, struct nfs_write_data *);
>>>> 	int	(*write_done)  (struct rpc_task *, struct nfs_write_data *);
>>>> 	void	(*commit_setup) (struct nfs_commit_data *, struct rpc_message *);
>>>> 	void	(*commit_rpc_prepare)(struct rpc_task *, struct nfs_commit_data *);
>>> 
>>> 
>> 
> 
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html