Re: [PATCH] NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 2012-08-13 at 12:26 -0400, Trond Myklebust wrote:
> On Sun, 2012-08-12 at 20:36 +0300, Boaz Harrosh wrote:
> > On 08/09/2012 06:39 PM, Myklebust, Trond wrote:
> > > If the problem is that the DS is failing to respond, how does the client
> > > know that the in-flight I/O has ended?
> > 
> > For the client, the above DS in question, has timed-out, we have reset
> > it's session and closed it's sockets. And all it's RPC requests have
> > been, or are being, ended with a timeout-error. So the timed-out
> > DS is a no-op. All it's IO request will end very soon, if not already.
> > 
> > A DS time-out is just a very valid, and meaningful response, just like
> > an op-done-with-error. This was what Andy added to the RFC's errata
> > which I agree with.
> > 
> > > 
> > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a
> > > data server that is not responding. It isn't attempting to use the
> > > layout after the layoutreturn: 
> > 
> > > the whole point is that we are attempting
> > > write-through-MDS after the attempt to write through the DS timed out.
> > > 
> > 
> > Trond STOP!!! this is pure bullshit. You guys took the opportunity of
> > me being in Hospital, and the rest of the bunch not having a clue. And
> > snuck in a patch that is totally wrong for everyone, not taking care of
> > any other LD *crashes* . And especially when this patch is wrong even for
> > files layout.
> > 
> > This above here is where you are wrong!! You don't understand my point,
> > and ignore my comments. So let me state it as clear as I can.
> 
> YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout
> for an RPC call once it has started. This is why we need fencing
> _specifically_ for the pNFS files client.
> 
> > (Lets assume files layout, for blocks and objects it's a bit different
> >  but mostly the same.)
> 
> That, and the fact that fencing hasn't been implemented for blocks and
> objects. The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT
> for each file with failed DS connection I/O) and touches only
> fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects.
> 
> > - Heavy IO is going on, the device_id in question has *3* DSs in it's
> >   device topography. Say DS1, DS2, DS3
> > 
> > - We have been queuing IO, and all queues are full. (we have 3 queues in
> >   in question, right? What is the maximum Q depth per files-DS? I know
> >   that in blocks and objects we usually have, I think, something like 128.
> >   This is a *tunable* in the block-layer's request-queue. Is it not some
> >   negotiated parameter with the NFS servers?)
> > 
> > - Now, boom DS2 has timed-out. The Linux-client resets the session and
> >   internally closes all sockets of that session. All the RPCs that
> >   belong to DS2 are being returned up with a timeout error. This one
> >   is just the first of all those belonging to this DS2. They will
> >   be decrementing the reference for this layout very, very soon.
> > 
> > - But what about DS1, and DS3 RPCs. What should we do with those?
> >   This is where you guys (Trond and Andy) are wrong. We must also
> >   wait for these RPC's as well. And opposite to what you think, this
> >   should not take long. Let me explain:
> > 
> >   We don't know anything about DS1 and DS3, each might be, either,
> >   "Having the same communication problem, like DS2". Or "is just working
> >   fine". So lets say for example that DS3 will also time-out in the
> >   future, and that DS1 is just fine and is writing as usual.
> > 
> >   * DS1 - Since it's working, it has most probably already done
> >     with all it's IO, because the NFS timeout is usually much longer
> >     then the normal RPC time, and since we are queuing evenly on
> >     all 3 DSs, at this point must probably, all of DS1 RPCs are
> >     already done. (And layout has been de-referenced).
> > 
> >   * DS3 - Will timeout in the future, when will that be?
> >     So let me start with, saying:
> >     (1). We could enhance our code and proactively, 
> >         "cancel/abort" all RPCs that belong to DS3 (more on this
> >          below)
> 
> Which makes the race _WORSE_. As I said above, there is no 'cancel RPC'
> operation in SUNRPC. Once your RPC call is launched, it cannot be
> recalled. All your discussion above is about the client side, and
> ignores what may be happening on the data server side. The fencing is
> what is needed to deal with the data server picture.
> 
> >     (2). Or We can prove that DS3's RPCs will timeout at worst
> >          case 1 x NFS-timeout after above DS2 timeout event, or
> >          2 x NFS-timeout after the queuing of the first timed-out
> >          RPC. And statistically in the average case DS3 will timeout
> >          very near the time DS2 timed-out.
> > 
> >          This is easy since the last IO we queued was the one that
> >          made DS2's queue to be full, and it was kept full because
> >          DS2 stopped responding and nothing emptied the queue.
> > 
> >      So the easiest we can do is wait for DS3 to timeout, soon
> >      enough, and once that will happen, session will be reset and all
> >      RPCs will end with an error.
> 
> 
> You are still only discussing the client side.
> 
> Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR
> CANCELED. Fencing is the closest we can come to an abort operation.
> 
> > So in the worst case scenario we can recover 2 x NFS-timeout after
> > a network partition, which is just 1 x NFS-timeout, after your
> > schizophrenic FENCE_ME_OFF, newly invented operation.
> > 
> > What we can do to enhance our code to reduce error recovery to
> > 1 x NFS-timeout:
> > 
> > - DS3 above:
> >   (As I said DS1's queues are now empty, because it was working fine,
> >    So DS3 is a representation of all DS's that have RPCs at the
> >    time DS2 timed-out, which belong to this layout)
> > 
> >   We can proactively abort all RPCs belonging to DS3. If there is
> >   a way to internally abort RPC's use that. Else just reset it's
> >   session and all sockets will close (and reopen), and all RPC's
> >   will end with a disconnect error.
> 
> Not on most servers that I'm aware of. If you close or reset the socket
> on the client, then the Linux server will happily continue to process
> those RPC calls; it just won't be able to send a reply.

One small correction here:
_If_ we are using NFSv4.2, and _if_ the client requests the
EXCHGID4_FLAG_SUPP_FENCE_OPS in the EXCHANGE_ID operation, and _if_ the
data server replies that it supports that, and _if_ the client gets a
successful reply to a DESTROY_SESSION call to the data server, _then_ it
can know that all RPC calls have completed.

However, we're not supporting NFSv4.2 yet.

> Furthermore, if the problem is that the data server isn't responding,
> then a socket close/reset tells you nothing either.

...and we still have no solution for this case.

> > - Both DS2 that timed-out, and DS3 that was aborted. Should be
> >   marked with a flag. When new IO that belong to some other
> >   inode through some other layout+device_id encounters a flagged
> >   device, it should abort and turn to MDS IO, with also invalidating
> >   it's layout, and hens, soon enough the device_id for DS2&3 will be
> >   de-referenced and be removed from device cache. (And all referencing
> >   layouts are now gone)
> 
> There is no RPC abort functionality Sun RPC. Again, this argument relies
> on functionality that _doesn't_ exist.
> 
> >   So we do not continue queuing new IO to dead devices. And since most
> >   probably MDS will not give us dead servers in new layout, we should be
> >   good.
> > In summery.
> > - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client
> >   *must not* skb-send a single byte belonging to a layout, after the send
> >   of LAYOUT_RETURN.
> >   (It need not wait for OPT_DONE from DS to do that, it just must make
> >    sure, that all it's internal, or on-the-wire request, are aborted
> >    by easily closing the sockets they belong too, and/or waiting for
> >    healthy DS's IO to be OPT_DONE . So the client is not dependent on
> >    any DS response, it is only dependent on it's internal state being
> >    *clean* from any more skb-send(s))
> 
> Ditto
> 
> > - The proper implementation of LAYOUT_RETURN on error for fast turnover
> >   is not hard, and does not involve a new invented NFS operation such
> >   as FENCE_ME_OFF. Proper codded client, independently, without
> >   the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround
> >   by actively returning all layouts that belong to a bad DS, and not
> >   waiting for a fence-off of a single layout, then encountering just
> >   the same error with all other layouts that have the same DS     
> 
> What do you mean by "all layouts that belong to a bad DS"? Layouts don't
> belong to a DS, and so there is no way to get from a DS to a layout.
> 
> > - And I know that just as you did not read my emails from before
> >   me going to Hospital, you will continue to not understand this
> >   one, or what I'm trying to explain, and will most probably ignore
> >   all of it. But please note one thing:
> 
> I read them, but just as now, they continue to ignore the reality about
> timeouts: timeouts mean _nothing_ in an RPC failover situation. There is
> no RPC abort functionality that you can rely on other than fencing.
> 
> >     YOU have sabotaged the NFS 4.1 Linux client, which is now totally
> >     not STD complaint, and have introduced CRASHs. And for no good
> >     reason.
> 
> See above.
> 

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@xxxxxxxxxx
www.netapp.com

��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux