Re: interrupted rpcs problem

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 15 Jan 2020 01:34:51 +0000

On Tue, 2020-01-14 at 18:37 -0500, Olga Kornievskaia wrote:
> On Tue, Jan 14, 2020 at 3:52 PM Trond Myklebust <
> trondmy@xxxxxxxxxxxxxxx> wrote:
> > On Tue, 2020-01-14 at 13:43 -0500, Olga Kornievskaia wrote:
> > > On Mon, Jan 13, 2020 at 4:51 PM Trond Myklebust <
> > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > On Mon, 2020-01-13 at 16:05 -0500, Olga Kornievskaia wrote:
> > > > > On Mon, Jan 13, 2020 at 1:24 PM Trond Myklebust <
> > > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > On Mon, 2020-01-13 at 13:09 -0500, Olga Kornievskaia wrote:
> > > > > > > On Mon, Jan 13, 2020 at 11:49 AM Trond Myklebust
> > > > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > On Mon, 2020-01-13 at 11:08 -0500, Olga Kornievskaia
> > > > > > > > wrote:
> > > > > > > > > On Fri, Jan 10, 2020 at 4:03 PM Trond Myklebust <
> > > > > > > > > trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > On Fri, 2020-01-10 at 14:29 -0500, Olga
> > > > > > > > > > Kornievskaia
> > > > > > > > > > wrote:
> > > > > > > > > > > Hi folks,
> > > > > > > > > > > 
> > > > > > > > > > > We are having an issue with an interrupted RPCs
> > > > > > > > > > > again.
> > > > > > > > > > > Here's
> > > > > > > > > > > what I
> > > > > > > > > > > see when xfstests were ctrl-c-ed.
> > > > > > > > > > > 
> > > > > > > > > > > frame 332 SETATTR call slot=0 seqid=0x000013ca
> > > > > > > > > > > (I'm
> > > > > > > > > > > assuming
> > > > > > > > > > > this
> > > > > > > > > > > is
> > > > > > > > > > > interrupted and released)
> > > > > > > > > > > frame 333 CLOSE call slot=0
> > > > > > > > > > > seqid=0x000013cb  (only
> > > > > > > > > > > way
> > > > > > > > > > > the
> > > > > > > > > > > slot
> > > > > > > > > > > could
> > > > > > > > > > > be free before the reply if it was interrupted,
> > > > > > > > > > > right?
> > > > > > > > > > > Otherwise
> > > > > > > > > > > we
> > > > > > > > > > > should never have the slot used by more than one
> > > > > > > > > > > outstanding
> > > > > > > > > > > RPC)
> > > > > > > > > > > frame 334 reply to 333 with SEQ_MIS_ORDERED (I'm
> > > > > > > > > > > assuming
> > > > > > > > > > > server
> > > > > > > > > > > received frame 333 before 332)
> > > > > > > > > > > frame 336 CLOSE call slot=0 seqid=0x000013ca (???
> > > > > > > > > > > why
> > > > > > > > > > > did
> > > > > > > > > > > we
> > > > > > > > > > > decremented it. I mean I know why it's in the
> > > > > > > > > > > current
> > > > > > > > > > > code :-
> > > > > > > > > > > / )
> > > > > > > > > > > frame 337 reply to 336 SEQUENCE with ERR_DELAY
> > > > > > > > > > > frame 339 reply to 332 SETATTR which nobody is
> > > > > > > > > > > waiting
> > > > > > > > > > > for
> > > > > > > > > > > frame 543 CLOSE call slot=0 seqid=0x000013ca
> > > > > > > > > > > (retry
> > > > > > > > > > > after
> > > > > > > > > > > waiting
> > > > > > > > > > > for
> > > > > > > > > > > err_delay)
> > > > > > > > > > > frame 544 reply to 543 with SETATTR (out of the
> > > > > > > > > > > cache).
> > > > > > > > > > > 
> > > > > > > > > > > What this leads to is: file is never closed on
> > > > > > > > > > > the
> > > > > > > > > > > server.
> > > > > > > > > > > Can't
> > > > > > > > > > > remove it. Unmount fails with CLID_BUSY.
> > > > > > > > > > > 
> > > > > > > > > > > I believe that's the result of commit
> > > > > > > > > > > 3453d5708b33efe76f40eca1c0ed60923094b971.
> > > > > > > > > > > We used to have code that bumped the sequence up
> > > > > > > > > > > when
> > > > > > > > > > > the
> > > > > > > > > > > slot
> > > > > > > > > > > was
> > > > > > > > > > > interrupted but after the commit "NFSv4.1: Avoid
> > > > > > > > > > > false
> > > > > > > > > > > retries
> > > > > > > > > > > when
> > > > > > > > > > > RPC calls are interrupted".
> > > > > > > > > > > 
> > > > > > > > > > > Commit has this "The obvious fix is to bump the
> > > > > > > > > > > sequence
> > > > > > > > > > > number
> > > > > > > > > > > pre-emptively if an
> > > > > > > > > > >     RPC call is interrupted, but in order to deal
> > > > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > corner
> > > > > > > > > > > cases
> > > > > > > > > > >     where the interrupted call is not actually
> > > > > > > > > > > received
> > > > > > > > > > > and
> > > > > > > > > > > processed
> > > > > > > > > > > by
> > > > > > > > > > >     the server, we need to interpret the error
> > > > > > > > > > > NFS4ERR_SEQ_MISORDERED
> > > > > > > > > > >     as a sign that we need to either wait or
> > > > > > > > > > > locate a
> > > > > > > > > > > correct
> > > > > > > > > > > sequence
> > > > > > > > > > >     number that lies between the value we sent,
> > > > > > > > > > > and
> > > > > > > > > > > the
> > > > > > > > > > > last
> > > > > > > > > > > value
> > > > > > > > > > > that
> > > > > > > > > > >     was acked by a SEQUENCE call on that slot."
> > > > > > > > > > > 
> > > > > > > > > > > If we can't no longer just bump the sequence up,
> > > > > > > > > > > I
> > > > > > > > > > > don't
> > > > > > > > > > > think
> > > > > > > > > > > the
> > > > > > > > > > > correct action is to automatically bump it down
> > > > > > > > > > > (as
> > > > > > > > > > > per
> > > > > > > > > > > example
> > > > > > > > > > > here)?
> > > > > > > > > > > The commit doesn't describe the corner case where
> > > > > > > > > > > it
> > > > > > > > > > > was
> > > > > > > > > > > necessary to
> > > > > > > > > > > bump the sequence up. I wonder if we can return
> > > > > > > > > > > the
> > > > > > > > > > > knowledge
> > > > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > interrupted slot and make a decision based on
> > > > > > > > > > > that as
> > > > > > > > > > > well as
> > > > > > > > > > > whatever
> > > > > > > > > > > the other corner case is.
> > > > > > > > > > > 
> > > > > > > > > > > I guess what I'm getting is, can somebody (Trond)
> > > > > > > > > > > provide
> > > > > > > > > > > the
> > > > > > > > > > > info
> > > > > > > > > > > for
> > > > > > > > > > > the corner case for this that patch was created.
> > > > > > > > > > > I
> > > > > > > > > > > can
> > > > > > > > > > > see if
> > > > > > > > > > > I
> > > > > > > > > > > can
> > > > > > > > > > > fix the "common" case which is now broken and not
> > > > > > > > > > > break
> > > > > > > > > > > the
> > > > > > > > > > > corner
> > > > > > > > > > > case....
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > There is no pure client side solution for this
> > > > > > > > > > problem.
> > > > > > > > > > 
> > > > > > > > > > The change was made because if you have multiple
> > > > > > > > > > interruptions
> > > > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > RPC call, then the client has to somehow figure out
> > > > > > > > > > what
> > > > > > > > > > the
> > > > > > > > > > correct
> > > > > > > > > > slot number is. If it starts low, and then goes
> > > > > > > > > > high,
> > > > > > > > > > and
> > > > > > > > > > the
> > > > > > > > > > server is
> > > > > > > > > > not caching the arguments for the RPC call that is
> > > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > session
> > > > > > > > > > cache, then we will _always_ hit this bug because
> > > > > > > > > > we
> > > > > > > > > > will
> > > > > > > > > > always
> > > > > > > > > > hit
> > > > > > > > > > the replay of the last entry.
> > > > > > > > > > 
> > > > > > > > > > At least if we start high, and iterate by low, then
> > > > > > > > > > we
> > > > > > > > > > reduce
> > > > > > > > > > the
> > > > > > > > > > problem to being a race with the processing of the
> > > > > > > > > > interrupted
> > > > > > > > > > request
> > > > > > > > > > as it is in this case.
> > > > > > > > > > 
> > > > > > > > > > However, as I said, the real solution here has to
> > > > > > > > > > involve
> > > > > > > > > > the
> > > > > > > > > > server.
> > > > > > > > > 
> > > > > > > > > Ok I see your point that if the server cached the
> > > > > > > > > arguments,
> > > > > > > > > then
> > > > > > > > > the
> > > > > > > > > server would tell that 2nd rpc using the same
> > > > > > > > > slot+seqid
> > > > > > > > > has
> > > > > > > > > different
> > > > > > > > > args and would not use the replay cache.
> > > > > > > > > 
> > > > > > > > > However, I wonder if the client can do better. Can't
> > > > > > > > > we
> > > > > > > > > be
> > > > > > > > > more
> > > > > > > > > aware
> > > > > > > > > of when we are interrupting the rpc? For instance, if
> > > > > > > > > we
> > > > > > > > > are
> > > > > > > > > interrupted after we started to wait on the RPC,
> > > > > > > > > doesn't
> > > > > > > > > it
> > > > > > > > > mean
> > > > > > > > > the
> > > > > > > > > rpc is sent on the network and since network is
> > > > > > > > > reliable
> > > > > > > > > then
> > > > > > > > > server
> > > > > > > > > must have consumed the seqid for that slot (in this
> > > > > > > > > case
> > > > > > > > > increment
> > > > > > > > > seqid)? That's the case that's failing now.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > "Reliable transport" does not mean that a client knows
> > > > > > > > what
> > > > > > > > got
> > > > > > > > received and processed by the server and what didn't.
> > > > > > > > All
> > > > > > > > the
> > > > > > > > client
> > > > > > > > knows is that if the connection is still up, then the
> > > > > > > > TCP
> > > > > > > > layer
> > > > > > > > will
> > > > > > > > keep retrying transmission of the request. There are
> > > > > > > > plenty
> > > > > > > > of
> > > > > > > > error
> > > > > > > > scenarios where the client gets no information back as
> > > > > > > > to
> > > > > > > > whether
> > > > > > > > or
> > > > > > > > not the data was received by the server (e.g. due to
> > > > > > > > lost
> > > > > > > > ACKs).
> > > > > > > > 
> > > > > > > > Furthermore, if a RPC call is interrupted on the
> > > > > > > > client,
> > > > > > > > either
> > > > > > > > due
> > > > > > > > to
> > > > > > > > a timeout or a signal,
> > > > > > > 
> > > > > > > What timeout are you referring to here since 4.1 rcp
> > > > > > > can't
> > > > > > > timeout. I
> > > > > > > think it only leaves a signal.
> > > > > > 
> > > > > > If you use 'soft' or 'softerr' mount options, then NFSv4.1
> > > > > > will
> > > > > > time
> > > > > > out when the server is being unresponsive. That behaviour
> > > > > > is
> > > > > > different
> > > > > > to the behaviour under a signal, but has the same effect of
> > > > > > interrupting the RPC call without us being able to know if
> > > > > > the
> > > > > > server
> > > > > > received the data.
> > > > > > 
> > > > > > > > then it almost always ends up breaking the
> > > > > > > > connection in order to avoid corruption of the data
> > > > > > > > stream
> > > > > > > > (by
> > > > > > > > interrupting the transmission before the entire RPC
> > > > > > > > call
> > > > > > > > has
> > > > > > > > been
> > > > > > > > sent). You generally have to be lucky to see the
> > > > > > > > timeout/signal
> > > > > > > > occur
> > > > > > > > only when all the RPC calls being cancelled have
> > > > > > > > exactly
> > > > > > > > fit
> > > > > > > > into
> > > > > > > > the
> > > > > > > > socket buffer.
> > > > > > > 
> > > > > > > Wouldn't a retransmission (due to a connection reset for
> > > > > > > whatever
> > > > > > > reason) be different and doesn't involve reprocessing of
> > > > > > > the
> > > > > > > slot.
> > > > > > 
> > > > > > I'm not talking about retransmissions here. I'm talking
> > > > > > only
> > > > > > about
> > > > > > NFSv4.x RPC calls that suffer a fatal interruption (i.e. no
> > > > > > retransmission).
> > > > > > 
> > > > > > > > Finally, just because the server's TCP layer ACKed
> > > > > > > > receipt
> > > > > > > > of
> > > > > > > > the
> > > > > > > > RPC
> > > > > > > > call data, that does not mean that it will process that
> > > > > > > > call.
> > > > > > > > The
> > > > > > > > connection could break before the call is read out of
> > > > > > > > the
> > > > > > > > receiving
> > > > > > > > socket, or the server may later decide to drop it on
> > > > > > > > the
> > > > > > > > floor
> > > > > > > > and
> > > > > > > > break the connection.
> > > > > > > > 
> > > > > > > > IOW: the RPC protocol here is not that "reliable
> > > > > > > > transport
> > > > > > > > implies
> > > > > > > > processing is guaranteed". It is rather that
> > > > > > > > "connection is
> > > > > > > > still
> > > > > > > > up
> > > > > > > > implies processing may eventually occur".
> > > > > > > 
> > > > > > > "eventually occur" means that its process of the rpc is
> > > > > > > guaranteed
> > > > > > > "in
> > > > > > > time". Again unless the client is broken, we can't have
> > > > > > > more
> > > > > > > than
> > > > > > > an
> > > > > > > interrupted rpc (that has nothing waiting) and the next
> > > > > > > rpc
> > > > > > > (both
> > > > > > > of
> > > > > > > which will be re-transmitted if connection is dropped)
> > > > > > > going
> > > > > > > to
> > > > > > > the
> > > > > > > server.
> > > > > > > 
> > > > > > > Can we distinguish between interrupted due to re-
> > > > > > > transmission
> > > > > > > and
> > > > > > > interrupted due to ctrl-c of the thread? If we can't,
> > > > > > > then
> > > > > > > I'll
> > > > > > > stop
> > > > > > > arguing that client can do better.
> > > > > > 
> > > > > > There is no "interrupted due to re-transmission" case. We
> > > > > > only
> > > > > > retransmit NFSv4 requests if the TCP connection breaks.
> > > > > > 
> > > > > > As far as I'm concerned, this discussion is only about
> > > > > > interruptions
> > > > > > that cause the RPC call to be abandoned (i.e. fatal
> > > > > > timeouts
> > > > > > and
> > > > > > signals).
> > > > > > 
> > > > > > > But right now we are left in a bad state. Client leaves
> > > > > > > opened
> > > > > > > state
> > > > > > > on the server and will not allow for files to be deleted.
> > > > > > > I
> > > > > > > think
> > > > > > > in
> > > > > > > case the "next rpc" is the write that will never be
> > > > > > > completed
> > > > > > > it
> > > > > > > would
> > > > > > > leave the machine in a hung state. I just don't see how
> > > > > > > can
> > > > > > > you
> > > > > > > justify that having the current code is any better than
> > > > > > > having
> > > > > > > the
> > > > > > > solution that was there before.
> > > > > > 
> > > > > > That's a general problem with allowing interruptions that
> > > > > > is
> > > > > > largely
> > > > > > orthogonal to the question of which strategy we choose when
> > > > > > resynchronising the slot numbers after an interruption has
> > > > > > occurred.
> > > > > > 
> > > > > 
> > > > > I'm re-reading the spec and in section 2.10.6.2 we have "A
> > > > > requester
> > > > > MUST wait for a reply to a request before using the slot for
> > > > > another
> > > > > request". Are we even legally using the slot when we have an
> > > > > interrupted slot?
> > > > > 
> > > > 
> > > > You can certainly argue that. However the fact that the spec
> > > > fails
> > > > to
> > > > address the issue doesn't imply lack of need. I have workloads
> > > > on
> > > > my
> > > > own systems that would cause major disruption if I did not
> > > > allow
> > > > them
> > > > to time out when the server is unavailable (e.g. with memory
> > > > filling up
> > > > with dirty pages that can't be cleaned).
> > > > 
> > > > IOW: I'm quite happy to make a best effort attempt to meet that
> > > > requirement, by making 'hard' mounts the default, and by making
> > > > signalling be a fatal operation. However I'm unwilling to make
> > > > it
> > > > impossible to fix up my system when the server is unresponsive
> > > > just
> > > > because the protocol is lazy about providing for that ability.
> > > 
> > > I'm just trying to think of options.
> > > 
> > > Here's what's bothering me. Yes my server list is limited but
> > > neither
> > > Linux not Netapp servers implements argument caching for replay
> > > cache
> > > due to performance hit. Yes perhaps they should but the problem
> > > is
> > > they currently don't and customer probably deserve the best
> > > solution
> > > given existing constraints. Do we have such a solution? Can you
> > > argue
> > > that the number of problems solved by the current solution is
> > > higher
> > > than by the other solution. With the current solution, we have
> > > (silent) data corruption and resource leakage on the server. I
> > > think
> > > that's a pretty big problem. What's worse silent data corruption
> > > or a
> > > hung client because it keeps sending same ops and keeps getting
> > > SEQ_MISORDERED (that's a rather big problem too but work arounds
> > > exist
> > > to prevent data corruption/loss. like forcing the client to
> > > re-negotiate the session).
> > > 
> > > Until servers catch up with addressing false retries, what's the
> > > best
> > > solution the client can have.
> > 
> > If I knew of a better client side solution, I would already have
> > implemented it.
> > 
> > > Silent data corruption is in pnfs because the writes are allowed
> > > to
> > > leave interrupted slot due to a timeout. Alternatively, I propose
> > > to
> > > then make pnfs writes same as the normal writes. It will remove
> > > the
> > > data corruption problem. It would still have resource leakage but
> > > that
> > > seems like a bit better than data corruption.
> > 
> > How often does this problem actually occur in the field?
> > 
> > The point is that if the interruption is due to a signal, then that
> > is
> > a signal that is fatal to the application. That's not silent
> > corruption; writes are expected not to complete in that kind of
> > situation.
> 
> Interruption is because the server isn't replying in time to a pnfs
> write. I believe we have a timeout of 10secs (or whatever that value
> is) after which this write is sent to the MDS. But the slot to the DS
> is left as interrupted (or at least it used to be that).
>
> Here's what needs to happen for the problem to occur. I don't know
> the
> likelihood it can occur.
> 
> 1. 1st DS write say gets stuck in the network for longer than the NFS
> client waits for.
> 2. Client interrupts the slot and re-sends to the MDS

That's not what I expect from an upstream 5.5-rcX client. The following
lines in rpc_check_timeout() should ensure the client does not
interrupt unless the network is disconnected:

                /*
                 * Once a "no retrans timeout" soft tasks (a.k.a NFSv4)
has
                 * been sent, it should time out only if the transport
                 * connection gets terminally broken.
                 */
                if ((task->tk_flags & RPC_TASK_NO_RETRANS_TIMEOUT) &&
                    rpc_check_connected(task->tk_rqstp))
                        return;

> 3. Another write to the DS is issued re-using the slot (seqid bumped)
> and this write gets to the server before the write in step1. Server
> replies with ERR_SEQ_MISORDERED.
> 4. Server processes the 1st write
> 5. client retires the write from step# with decremented sequin. Gets
> a
> reply saying write was completed. But in reality this write never
> writes the data. Thus silent data corruption.

So try the following change: change the nfs4_init_sequence() line in
nfs4_proc_write_setup() so that it reads

nfs4_init_sequence(&hdr->args.seq_args, &hdr->res.seq_res, 0, 0);

There really is no good reason for us to be caching write replies when
we know the client is already ordering NFS writes to the same range.

> > If the interruption is due to a soft mount timing out, then it is
> > because the user deliberately chose that non-default setting, and
> > should be aware of the consequences.
> > Note that in newer kernels, these soft timeouts do not trigger
> > unless
> > the network connection is also down, so there is already a recovery
> > process required to re-establish the network connection before any
> > further requests can be sent.
> > Furthermore, the application should also be seeing a POSIX error
> > when
> > fsync() is called, so again, there should be no silent corruption.
> > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@xxxxxxxxxxxxxxx
> > 
> > 
-- 
Trond Myklebust
CTO, Hammerspace Inc
4300 El Camino Real, Suite 105
Los Altos, CA 94022
www.hammer.space