On Mon, 2019-03-11 at 11:14 -0400, Olga Kornievskaia wrote: > On Mon, Mar 11, 2019 at 11:12 AM Trond Myklebust > <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Mon, 2019-03-11 at 14:30 +0000, Trond Myklebust wrote: > > > Hi Olga, > > > > > > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote: > > > > There are a bunch of cases where multiple operations are using > > > > the > > > > same seqid and slot. > > > > > > > > Example of such weirdness is (nfs.seqid == 0x000002f4) && > > > > (nfs.slotid > > > > == 0) and the one leading to the hang. > > > > > > > > In frame 415870, there is an OPEN using that seqid and slot for > > > > the > > > > first time (but this slot will be re-used a bunch of times > > > > before > > > > it > > > > gets a reply in frame 415908 with the open stateid > > > > seq=40). (also > > > > in > > > > this packet there is an example of reuse > > > > slot=1+seqid=0x000128f7 by > > > > both TEST_STATEID and OPEN but let's set that aside). > > > > > > > > In frame 415874 (in the same packet), client sends 5 opens on > > > > the > > > > SAME > > > > seqid and slot (all have distinct xids). In a ways that's end > > > > up > > > > being > > > > alright since opens are for the same file and thus reply out of > > > > the > > > > cache and the reply is ERR_DELAY. But in frame 415876, client > > > > sends > > > > again uses the same seqid and slot and in this case it's used > > > > by > > > > 3opens and a test_stateid. > > > > This should result in exactly 1 bump of the stateid seqid. > > > > > > Client in all this mess never processes the open stateid seq=40 > > > > and > > > > keeps on resending CLOSE with seq=37 (also to note client > > > > "missed" > > > > processing seqid=38 and 39 as well. 39 probably because it was > > > > a > > > > reply > > > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I > > > > haven't > > > > tracked 38 but i'm assuming it's the same). I don't know how > > > > many > > > > times but after 5mins, I see a TEST_STATEID that again uses the > > > > same > > > > seqid+slot (which gets a reply from the cache matching OPEN). > > > > Also > > > > open + close (still with seq=37) open is replied to but after > > > > this > > > > client goes into a soft lockup logs have > > > > "nfs4_schedule_state_manager: > > > > kthread_ruan: -4" over and over again . then a soft lockup. > > > > > > > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in > > > > frame=415866 > > > > by the TEST_STATEID. This is replied to in frame 415877 (with > > > > an > > > > ERR_DELAY). But before the client got a reply, it used the slot > > > > and > > > > the seq by frame 415874. TEST_STATEID is a synchronous and > > > > interruptible operation. I'm suspecting that somehow it was > > > > interrupted and that's who the slot was able to be re-used by > > > > the > > > > frame 415874. But how the several opens were able to get the > > > > same > > > > slot > > > > I don't know.. > > > > > > Is this still true with the current linux-next? I would expect > > > this > > > patch > > > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971 > > > to change the Linux client behaviour in the above regard. > > > > > > > Note also that what you describe would appear to indicate a serious > > server bug. If the client is reusing a slot+seqid, then the correct > > server behaviour is to either return one of the errors > > NFS4ERR_DELAY, > > NFS4ERR_SEQ_FALSE_RETRY, NFS4ERR_RETRY_UNCACHED_REP, > > NFS4ERR_SEQ_MISORDERED, or else it must replay the exact same reply > > that it had cached for the original request. > > > > It is protocol violation for the server to execute new requests on > > a > > slot+seqid combination that has already been used. For that reason, > > it > > is also impossible for a replay to cause further state changes on > > the > > server; a replay means that the server belts out the exact same > > reply > > that was attempted sent earlier with no changes (with stateids that > > match bit for bit what would have been sent earlier). > > > > But it is the same requests because all of them are opens on the same > file same everything. That is irrelevant. The whole point of the session slot mechanism is to provide reliable only once semantics as defined in section 2.10.6 of RFC5661: https://tools.ietf.org/html/rfc5661#section-2.10.6 -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx