On Tue, 2012-08-14 at 02:39 +0300, Boaz Harrosh wrote: > On 08/13/2012 07:26 PM, Myklebust, Trond wrote: > > >> This above here is where you are wrong!! You don't understand my point, > >> and ignore my comments. So let me state it as clear as I can. > > > > YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout > > for an RPC call once it has started. This is why we need fencing > > _specifically_ for the pNFS files client. > > > > > Again we have a communication problem between us. I say some words and > mean one thing, and you say and hear the same words but attach different > meanings to them. This is no one's fault it's just is. > > Lets do an experiment, mount a regular NFS4 in -o soft mode and start > writing to server, say with dd. Now disconnect the cable. After some timeout > the dd will return with "IO error", and will stop writing to file. > > This is the timeout I mean. Surely some RPC-requests did not complete and > returned to NFS core with some kind of error. > > With RPC-requests I do not mean the RPC protocol on the wire, I mean that > entity inside the Linux Kernel which represents an RPC. Surly some > linux-RPC-requests objects were not released do to a server "rpc-done" > received. But do to an internal mechanism that called the "release" method do to > a communication timeout. > > So this is what I call "returned with a timeout". It does exist and used > every day. > > Even better if I don't disconnect the wire but do an if_down or halt on the > server, the dd's IO error will happen immediately, not even wait for > any timeout. This is because the socket is orderly closed and all > sends/receives will return quickly with a "disconnect-error". > > When I use a single Server like the nfs4 above. Then there is one fact > in above scenario that I want to point out: > > At some point in the NFS-Core state. There is a point that no more > requests are issued, all old request have released, and an error is > returned to the application. At that point the client will not call > skb-send, and will not try farther communication with the Server. > > This is what must happen with ALL DSs that belong to a layout, before > client should be LAYOUT_RETURN(ing). The client can only do it's job. That is: > > STOP any skb-send, to any of the DSs in a layout. > Only then it is complying to the RFC. > > So this is what I mean by "return with a timeout below" I hear you, now listen to me. Who _cares_ if the client sends an RPC call after the layoutreturn? In the case of an unresponsive data server the client can't guarantee that this won't happen even if it does wait. A pNFS files server that doesn't propagate the layoutreturn to the data server in a timely fashion is fundamentally _broken_ in the case where the communication between the data server and client is down. It cannot offer any data integrity guarantees when the client tries to write through MDS, because the DSes may still be processing old write RPC calls. > >> (Lets assume files layout, for blocks and objects it's a bit different > >> but mostly the same.) > > > > That, and the fact that fencing hasn't been implemented for blocks and > > objects. > > > That's not true. At Panasas and both at EMC there is fencing in place and > it is used every day. This is why I insist that it is very much > the same for all of us. I'm talking about the use of layoutreturn for client fencing, which is only implemented for files. However Tao admitted that the blocks client has not yet implemented the timed-lease fencing as described in RFC5663, so there is still work to be done there. I've no idea what the object client is doing. > > The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT > > for each file with failed DS connection I/O) and touches only > > fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects. > > > > > OK I had in mind the patches that Andy sent. I'll look again for what > actually went in. (It was all while I was unavailable) He sent a revised patch set, which should only affect the files layout. > >> - Heavy IO is going on, the device_id in question has *3* DSs in it's > >> device topography. Say DS1, DS2, DS3 > >> > >> - We have been queuing IO, and all queues are full. (we have 3 queues in > >> in question, right? What is the maximum Q depth per files-DS? I know > >> that in blocks and objects we usually have, I think, something like 128. > >> This is a *tunable* in the block-layer's request-queue. Is it not some > >> negotiated parameter with the NFS servers?) > >> > >> - Now, boom DS2 has timed-out. The Linux-client resets the session and > >> internally closes all sockets of that session. All the RPCs that > >> belong to DS2 are being returned up with a timeout error. This one > >> is just the first of all those belonging to this DS2. They will > >> be decrementing the reference for this layout very, very soon. > >> > >> - But what about DS1, and DS3 RPCs. What should we do with those? > >> This is where you guys (Trond and Andy) are wrong. We must also > >> wait for these RPC's as well. And opposite to what you think, this > >> should not take long. Let me explain: > >> > >> We don't know anything about DS1 and DS3, each might be, either, > >> "Having the same communication problem, like DS2". Or "is just working > >> fine". So lets say for example that DS3 will also time-out in the > >> future, and that DS1 is just fine and is writing as usual. > >> > >> * DS1 - Since it's working, it has most probably already done > >> with all it's IO, because the NFS timeout is usually much longer > >> then the normal RPC time, and since we are queuing evenly on > >> all 3 DSs, at this point must probably, all of DS1 RPCs are > >> already done. (And layout has been de-referenced). > >> > >> * DS3 - Will timeout in the future, when will that be? > >> So let me start with, saying: > >> (1). We could enhance our code and proactively, > >> "cancel/abort" all RPCs that belong to DS3 (more on this > >> below) > > > > Which makes the race _WORSE_. As I said above, there is no 'cancel RPC' > > operation in SUNRPC. Once your RPC call is launched, it cannot be > > recalled. All your discussion above is about the client side, and > > ignores what may be happening on the data server side. The fencing is > > what is needed to deal with the data server picture. > > > > > Again, some miss understanding. I never said we should not send > a LAYOUT_RETURN before writing through MDS. The opposite is true, > I think it is a novel idea and gives you the kind of barrier that > will harden and robust the system. > > WHAT I'm saying is that this cannot happen while the schizophrenic > client is busily still actively skb-sending more and more bytes > to all the other DSs in the layout. LONG AFTER THE LAYOUT_RETURN > HAS BEEN SENT AND RESPONDED. > > So what you are saying does not at all contradicts what I want. > > "The fencing is what is needed to deal with the data server picture" > > Fine But ONLY after the client has really stopped all sends. > (Each one will do it's job) > > BTW: The Server does not *need* the Client to send a LAYOUT_RETURN > It's just a nice-to-have, which I'm fine with. > Both Panasas and EMC when IO is sent through MDS, will first > recall, overlapping layouts, and only then proceed with > MDS processing. (This is some deeply rooted mechanism inside > the FS, an MDS being just another client) Again, we're not talking about blocks or objects. a > So this is a known problem that is taken care of. But I totally > agree with you, the client LAYOUT_RETURN(ing) the layout will save > lots of protocol time by avoiding the recalls. > Now you understand why in Objects we mandated this LAYOUT_RETURN > on errors. And while at it we want the exact error reported. > > >> (2). Or We can prove that DS3's RPCs will timeout at worst > >> case 1 x NFS-timeout after above DS2 timeout event, or > >> 2 x NFS-timeout after the queuing of the first timed-out > >> RPC. And statistically in the average case DS3 will timeout > >> very near the time DS2 timed-out. > >> > >> This is easy since the last IO we queued was the one that > >> made DS2's queue to be full, and it was kept full because > >> DS2 stopped responding and nothing emptied the queue. > >> > >> So the easiest we can do is wait for DS3 to timeout, soon > >> enough, and once that will happen, session will be reset and all > >> RPCs will end with an error. > > > > > > You are still only discussing the client side. > > > > Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR > > CANCELED. Fencing is the closest we can come to an abort operation. > > > > > Again I did not mean the "Sun RPC OPERATIONS" on the wire. I meant > the Linux-request-entity which while exist has a potential to be > submitted for skb-send. As seen above these entities do timeout > in "-o soft" mode and once released remove the potential of any more > future skb-sends on the wire. > > BUT what I do not understand is: In above example we are talking > about DS3. We assumed that DS3 has a communication problem. So no amount > of "fencing" or vudu or any other kind of operation can ever affect > the client regarding DS3. Because even if On-the-server pending requests > from client on DS3 are fenced and discarded these errors will not > be communicated back the client. The client will sit idle on DS3 > communication until the end of the timeout, regardless. We don't care about any receiving the errors. We've timed out. All we want to do is to fence off the damned writes that have already been sent to the borken DS and then resend them through the MDS. > Actually what I propose for DS3 in the best robust client is to > destroy it's DS3's sessions and therefor cause all Linux-request-entities > to return much much faster, then if *just waiting* for the timeout to expire. I repeat: destroying the session on the client does NOTHING to help you. > >> So in the worst case scenario we can recover 2 x NFS-timeout after > >> a network partition, which is just 1 x NFS-timeout, after your > >> schizophrenic FENCE_ME_OFF, newly invented operation. > >> > >> What we can do to enhance our code to reduce error recovery to > >> 1 x NFS-timeout: > >> > >> - DS3 above: > >> (As I said DS1's queues are now empty, because it was working fine, > >> So DS3 is a representation of all DS's that have RPCs at the > >> time DS2 timed-out, which belong to this layout) > >> > >> We can proactively abort all RPCs belonging to DS3. If there is > >> a way to internally abort RPC's use that. Else just reset it's > >> session and all sockets will close (and reopen), and all RPC's > >> will end with a disconnect error. > > > > Not on most servers that I'm aware of. If you close or reset the socket > > on the client, then the Linux server will happily continue to process > > those RPC calls; it just won't be able to send a reply. > > Furthermore, if the problem is that the data server isn't responding, > > then a socket close/reset tells you nothing either. > > > > > Again I'm talking about the NFS-Internal-request-entities these will > be released, though guarantying that no more threads will use any of > these to send any more bytes over to any DSs. > > AND yes, yes. Once the client has done it's job and stopped any future > skb-sends to *all* DSs in question, only then it can report to MDS: > "Hey I'm done sending in all other routs here LAYOUT_RETURN" > (Now fencing happens on servers) > > and client goes on and says > > "Hey can you MDS please also write this data" > > Which is perfect for MDS because otherwise if it wants to make sure, > it will need to recall all outstanding layouts, exactly for your > reason, for concern for the data corruption that can happen. So it recalls the layouts, and then... it _still_ has to fence off any writes that are in progress on the broken DS. All you've done is add a recall to the whole process. Why? > >> - Both DS2 that timed-out, and DS3 that was aborted. Should be > >> marked with a flag. When new IO that belong to some other > >> inode through some other layout+device_id encounters a flagged > >> device, it should abort and turn to MDS IO, with also invalidating > >> it's layout, and hens, soon enough the device_id for DS2&3 will be > >> de-referenced and be removed from device cache. (And all referencing > >> layouts are now gone) > > > > There is no RPC abort functionality Sun RPC. Again, this argument relies > > on functionality that _doesn't_ exist. > > > > > Again I mean internally at the client. For example closing the socket will > have the effect I want. (And some other tricks we can talk about those > later, lets agree about the principal first) Timing out will prevent the damned client from sending more data. So what? > >> So we do not continue queuing new IO to dead devices. And since most > >> probably MDS will not give us dead servers in new layout, we should be > >> good. > >> In summery. > >> - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client > >> *must not* skb-send a single byte belonging to a layout, after the send > >> of LAYOUT_RETURN. > >> (It need not wait for OPT_DONE from DS to do that, it just must make > >> sure, that all it's internal, or on-the-wire request, are aborted > >> by easily closing the sockets they belong too, and/or waiting for > >> healthy DS's IO to be OPT_DONE . So the client is not dependent on > >> any DS response, it is only dependent on it's internal state being > >> *clean* from any more skb-send(s)) > > > > Ditto > > > >> - The proper implementation of LAYOUT_RETURN on error for fast turnover > >> is not hard, and does not involve a new invented NFS operation such > >> as FENCE_ME_OFF. Proper codded client, independently, without > >> the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround > >> by actively returning all layouts that belong to a bad DS, and not > >> waiting for a fence-off of a single layout, then encountering just > >> the same error with all other layouts that have the same DS > > > > What do you mean by "all layouts that belong to a bad DS"? Layouts don't > > belong to a DS, and so there is no way to get from a DS to a layout. > > > > > Why, sure. loop on all layouts and ask if it has a specific DS. > >> - And I know that just as you did not read my emails from before > >> me going to Hospital, you will continue to not understand this > >> one, or what I'm trying to explain, and will most probably ignore > >> all of it. But please note one thing: > > > > I read them, but just as now, they continue to ignore the reality about > > timeouts: timeouts mean _nothing_ in an RPC failover situation. There is > > no RPC abort functionality that you can rely on other than fencing. > > > > > I hope I explained this by now. If not please, please, lets organize > a phone call. We can use Panasas conference number whenever you are > available. I think we communicate better in person. > > Everyone else is also invited. > > BUT there is one most important point for me: > > As stated by the RFC. Client must guaranty that no more bytes will be > sent to any DSs in a layout, once LAYOUT_RETURN is sent. This is the > only definition of LAYOUT_RETURN, and NO_MATCHING_LAYOUT as response > to a LAYOUT_RECALL. Which is: > Client has indicated no more future sends on a layout. (And server will > enforce it with a fencing) The client can't guarantee that. The protocol offers no way for it to do so, no matter what the pNFS text may choose to say. > >> YOU have sabotaged the NFS 4.1 Linux client, which is now totally > >> not STD complaint, and have introduced CRASHs. And for no good > >> reason. > > > > See above. > > > > > OK We'll have to see about these crashes, lets talk about them. > > Thanks > Boaz -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥