On Mon, 2012-08-13 at 12:26 -0400, Trond Myklebust wrote: > On Sun, 2012-08-12 at 20:36 +0300, Boaz Harrosh wrote: > > On 08/09/2012 06:39 PM, Myklebust, Trond wrote: > > > If the problem is that the DS is failing to respond, how does the client > > > know that the in-flight I/O has ended? > > > > For the client, the above DS in question, has timed-out, we have reset > > it's session and closed it's sockets. And all it's RPC requests have > > been, or are being, ended with a timeout-error. So the timed-out > > DS is a no-op. All it's IO request will end very soon, if not already. > > > > A DS time-out is just a very valid, and meaningful response, just like > > an op-done-with-error. This was what Andy added to the RFC's errata > > which I agree with. > > > > > > > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > > > data server that is not responding. It isn't attempting to use the > > > layout after the layoutreturn: > > > > > the whole point is that we are attempting > > > write-through-MDS after the attempt to write through the DS timed out. > > > > > > > Trond STOP!!! this is pure bullshit. You guys took the opportunity of > > me being in Hospital, and the rest of the bunch not having a clue. And > > snuck in a patch that is totally wrong for everyone, not taking care of > > any other LD *crashes* . And especially when this patch is wrong even for > > files layout. > > > > This above here is where you are wrong!! You don't understand my point, > > and ignore my comments. So let me state it as clear as I can. > > YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout > for an RPC call once it has started. This is why we need fencing > _specifically_ for the pNFS files client. > > > (Lets assume files layout, for blocks and objects it's a bit different > > but mostly the same.) > > That, and the fact that fencing hasn't been implemented for blocks and > objects. The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT > for each file with failed DS connection I/O) and touches only > fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects. > > > - Heavy IO is going on, the device_id in question has *3* DSs in it's > > device topography. Say DS1, DS2, DS3 > > > > - We have been queuing IO, and all queues are full. (we have 3 queues in > > in question, right? What is the maximum Q depth per files-DS? I know > > that in blocks and objects we usually have, I think, something like 128. > > This is a *tunable* in the block-layer's request-queue. Is it not some > > negotiated parameter with the NFS servers?) > > > > - Now, boom DS2 has timed-out. The Linux-client resets the session and > > internally closes all sockets of that session. All the RPCs that > > belong to DS2 are being returned up with a timeout error. This one > > is just the first of all those belonging to this DS2. They will > > be decrementing the reference for this layout very, very soon. > > > > - But what about DS1, and DS3 RPCs. What should we do with those? > > This is where you guys (Trond and Andy) are wrong. We must also > > wait for these RPC's as well. And opposite to what you think, this > > should not take long. Let me explain: > > > > We don't know anything about DS1 and DS3, each might be, either, > > "Having the same communication problem, like DS2". Or "is just working > > fine". So lets say for example that DS3 will also time-out in the > > future, and that DS1 is just fine and is writing as usual. > > > > * DS1 - Since it's working, it has most probably already done > > with all it's IO, because the NFS timeout is usually much longer > > then the normal RPC time, and since we are queuing evenly on > > all 3 DSs, at this point must probably, all of DS1 RPCs are > > already done. (And layout has been de-referenced). > > > > * DS3 - Will timeout in the future, when will that be? > > So let me start with, saying: > > (1). We could enhance our code and proactively, > > "cancel/abort" all RPCs that belong to DS3 (more on this > > below) > > Which makes the race _WORSE_. As I said above, there is no 'cancel RPC' > operation in SUNRPC. Once your RPC call is launched, it cannot be > recalled. All your discussion above is about the client side, and > ignores what may be happening on the data server side. The fencing is > what is needed to deal with the data server picture. > > > (2). Or We can prove that DS3's RPCs will timeout at worst > > case 1 x NFS-timeout after above DS2 timeout event, or > > 2 x NFS-timeout after the queuing of the first timed-out > > RPC. And statistically in the average case DS3 will timeout > > very near the time DS2 timed-out. > > > > This is easy since the last IO we queued was the one that > > made DS2's queue to be full, and it was kept full because > > DS2 stopped responding and nothing emptied the queue. > > > > So the easiest we can do is wait for DS3 to timeout, soon > > enough, and once that will happen, session will be reset and all > > RPCs will end with an error. > > > You are still only discussing the client side. > > Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR > CANCELED. Fencing is the closest we can come to an abort operation. > > > So in the worst case scenario we can recover 2 x NFS-timeout after > > a network partition, which is just 1 x NFS-timeout, after your > > schizophrenic FENCE_ME_OFF, newly invented operation. > > > > What we can do to enhance our code to reduce error recovery to > > 1 x NFS-timeout: > > > > - DS3 above: > > (As I said DS1's queues are now empty, because it was working fine, > > So DS3 is a representation of all DS's that have RPCs at the > > time DS2 timed-out, which belong to this layout) > > > > We can proactively abort all RPCs belonging to DS3. If there is > > a way to internally abort RPC's use that. Else just reset it's > > session and all sockets will close (and reopen), and all RPC's > > will end with a disconnect error. > > Not on most servers that I'm aware of. If you close or reset the socket > on the client, then the Linux server will happily continue to process > those RPC calls; it just won't be able to send a reply. One small correction here: _If_ we are using NFSv4.2, and _if_ the client requests the EXCHGID4_FLAG_SUPP_FENCE_OPS in the EXCHANGE_ID operation, and _if_ the data server replies that it supports that, and _if_ the client gets a successful reply to a DESTROY_SESSION call to the data server, _then_ it can know that all RPC calls have completed. However, we're not supporting NFSv4.2 yet. > Furthermore, if the problem is that the data server isn't responding, > then a socket close/reset tells you nothing either. ...and we still have no solution for this case. > > - Both DS2 that timed-out, and DS3 that was aborted. Should be > > marked with a flag. When new IO that belong to some other > > inode through some other layout+device_id encounters a flagged > > device, it should abort and turn to MDS IO, with also invalidating > > it's layout, and hens, soon enough the device_id for DS2&3 will be > > de-referenced and be removed from device cache. (And all referencing > > layouts are now gone) > > There is no RPC abort functionality Sun RPC. Again, this argument relies > on functionality that _doesn't_ exist. > > > So we do not continue queuing new IO to dead devices. And since most > > probably MDS will not give us dead servers in new layout, we should be > > good. > > In summery. > > - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client > > *must not* skb-send a single byte belonging to a layout, after the send > > of LAYOUT_RETURN. > > (It need not wait for OPT_DONE from DS to do that, it just must make > > sure, that all it's internal, or on-the-wire request, are aborted > > by easily closing the sockets they belong too, and/or waiting for > > healthy DS's IO to be OPT_DONE . So the client is not dependent on > > any DS response, it is only dependent on it's internal state being > > *clean* from any more skb-send(s)) > > Ditto > > > - The proper implementation of LAYOUT_RETURN on error for fast turnover > > is not hard, and does not involve a new invented NFS operation such > > as FENCE_ME_OFF. Proper codded client, independently, without > > the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround > > by actively returning all layouts that belong to a bad DS, and not > > waiting for a fence-off of a single layout, then encountering just > > the same error with all other layouts that have the same DS > > What do you mean by "all layouts that belong to a bad DS"? Layouts don't > belong to a DS, and so there is no way to get from a DS to a layout. > > > - And I know that just as you did not read my emails from before > > me going to Hospital, you will continue to not understand this > > one, or what I'm trying to explain, and will most probably ignore > > all of it. But please note one thing: > > I read them, but just as now, they continue to ignore the reality about > timeouts: timeouts mean _nothing_ in an RPC failover situation. There is > no RPC abort functionality that you can rely on other than fencing. > > > YOU have sabotaged the NFS 4.1 Linux client, which is now totally > > not STD complaint, and have introduced CRASHs. And for no good > > reason. > > See above. > -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥