Re: [PATCH 6/6] RPC: adjust timeout for connect, bind, restablish so that they sensitive to the major time out value

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 02/05/2010 11:05 PM, Batsakis, Alexandros wrote:

My replies marked with the "AB" prefix

-----Original Message-----
From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
Sent: Fri 2/5/2010 4:11 PM
To: Batsakis, Alexandros
Cc: Batsakis, Alexandros; linux-nfs@xxxxxxxxxxxxxxx; Myklebust, Trond
Subject: Re: [PATCH 6/6] RPC: adjust timeout for connect, bind,
restablish so that they sensitive to the major time out value

On 02/05/2010 06:04 PM, Batsakis, Alexandros wrote:
 >
 >
 > On Feb 5, 2010, at 14:47, "Chuck Lever" <chuck.lever@xxxxxxxxxx> wrote:
 >
 >> On 02/05/2010 05:14 PM, Batsakis, Alexandros wrote:
 >>> Yeah sure,
 >>>
 >>> So imagine that for a specific connection the remaining major timeo
 >>> value is 30secs. Xs_connect has a default timeout before attempting to
 >>> reconnect of 60secs. The user (NFS) expects to "hear back" from the rpc
 >>> layer within the timeout as in often cases e.g. lease renewal, it's of
 >>> no benefit for an operation to reach the server at a later time and
miss
 >>> the critical time because it was sleeping for an arbitrary amount of
 >>> time.

Maybe you want RPC_TASK_SOFTCONN for NFSv4 renewals instead of
RPC_TASK_SOFT. This would cause the RENEW request to fail immediately
if the transport can't connect.

AB: is this a new flag ? I am not familiar with it. Or are you proposing
to add such a flag?
It's not an unreasonable thing to do

The flag was added recently (maybe in 2.6.33-rc?). It causes an individual RPC request to fail immediately if the underlying transport cannot be connected. It bypasses the reconnect timeout if the transport is not already connected.

Can you describe the scenario more precisely? If I wanted to reproduce
this here, how would I do it?

AB: mount a 4.1 server (or v4 with state, e.g. an open/lock etc.)
Then kill the server. With a lease time of 90 seconds the client
attempts to reconnect will come
roughly at 60-80secs and 120-140 secs. So the lease period of 90 secs is
lost.

That doesn't sound right to me. xs_connect is supposed to retry after 3 seconds and then back off exponentially until it is retrying once every 5 minutes. How do you determine when the client is retrying the connect: by watching for SYN packets or by enabling dprintk messages in xprtsock?

 >> There are good reasons why there is a timeout before reestablishing a
 >> connection. Have you tested this patch with NFSv3 servers that are
 >> going up and down repeatedly, for example? I think skipping the
 >> reconnect delay could have consequences for the cases which the
 >> original xs_connect logic is supposed to address, and it's not
 >> something we want in many other cases.
 >>
 >
 > I am not skipping the reconnect delay. What i am saying is that it seems
 > wrong to me to hard-code the reconnection delay. Why 60secs for example
 > ? To me it seems that this value should be related to the timeout. Do
 > you disagree ?

Hrm. It looks like the TCP re-establish timeout starts at 3 seconds,
and then backs off to 5 minutes. So, it's not fixed, it's related to
how quickly the server responds to the client's connect() attempt.

AB: Agreed. All I am saying is let's cap this value.

It's capped today at 5 minutes. See XS_TCP_MAX_REEST_TO. Perhaps it could be lowered to 60 seconds.

But I don't yet see a need to link the reconnect timeout to the retransmit timeout. In nearly every case, a major timeout should occur if the client can't reconnect, and there's no harm done if that's delayed longer than is optimal.

RENEW is really quite the exception here. This procedure is the only one I can think of where the request has to start _and_ finish within a particular time limit.

I don't know as much about NFSv4 and especially RENEW as I should, but it seems to me there are a number of reasons why RENEW should perhaps be given its own transport.

1. If a RENEW is sent every 90 seconds, that means there's no possibility for the underlying transports to idle out.

2. We only need one RENEW per client-server pair, I think. Isn't it tied to the client ID?

3. Because of its timing requirements, we can't depend on a RENEW getting through a heavily loaded rpc_clnt at a given time.

4.  It seems to want different retransmit behavior than any other procedure.

So perhaps the best solution is for the client to set up a separate rpc_clnt/rpc_xprt for each mounted server (but not each mount point) that is dedicated to handling RENEW requests for that server. It would have retransmit timeouts that are determined by the lease timer, and not by the retransmit timeout specified on the mount command line.

This works around the fact that TCP streams suffer from head-of-queue blocking, as well as attempting to work around a long RPC client backlog queue.

The only downside I can think of is this might consume excess privileged ports on a client that mounts hundreds of different servers.

 >> Perhaps a better idea would be to mark these particular RPCs with some
 >> kind of indication that the RPC client has to connect immediately for
 >> this request, if possible. Similar to RPC_TASK_SOFTCONN.
 >
 >> In general, sunrpc.ko has a problem with this kind of "urgent" RPC
 >> request. If the RPC backlog queue is large for a particular rpc_clnt,
 >> it can often take many seconds (like longer than the major timeout)
 >> for a request to actually get down to the transport and get sent. I
 >> don't see that these timeout changes necessarily address that at all.

AB: again here I think that the correct solution would be to increase
the timeout value rather than ignore it. What is the purpose of the
timeout anyways then ?

 > Bu if the timeout has expired rpc_execute will quit the task anyways.
 > What is the downside of instead of sleeping for a long, arbitrary period
 > to wake up and poll the server at intervals that actually make some
 > sense to the client?

If the 5 minute backoff maximum is too long, then you can easily reduce
that maximum (which is probably not unreasonable). We originally had
the client retrying to connect every 3 seconds, but it was thought that
would put unnecessary load on servers.

AB: Again, my argument is that we shouldn't select these values arbitrarily.
In all, I see your points. If I understand correctly
you don't consider the tcp reconnection policy that I am describing an
important issue. I assume that you are OK
with the timeout variability in the state machine ?

I think changing the timeout logic in both places is probably overkill, and may even have negative effects on nonidempotent requests. I think it's wise to depend on retransmit settings as little as possible, as retransmitting an RPC is basically a hack, when you get right down to it.

I'm still open to discussion, though.

--
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux