On Mon, Mar 11, 2019 at 10:30 AM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > Hi Olga, > > On Sun, 2019-03-10 at 18:20 -0400, Olga Kornievskaia wrote: > > There are a bunch of cases where multiple operations are using the > > same seqid and slot. > > > > Example of such weirdness is (nfs.seqid == 0x000002f4) && (nfs.slotid > > == 0) and the one leading to the hang. > > > > In frame 415870, there is an OPEN using that seqid and slot for the > > first time (but this slot will be re-used a bunch of times before it > > gets a reply in frame 415908 with the open stateid seq=40). (also in > > this packet there is an example of reuse slot=1+seqid=0x000128f7 by > > both TEST_STATEID and OPEN but let's set that aside). > > > > In frame 415874 (in the same packet), client sends 5 opens on the > > SAME > > seqid and slot (all have distinct xids). In a ways that's end up > > being > > alright since opens are for the same file and thus reply out of the > > cache and the reply is ERR_DELAY. But in frame 415876, client sends > > again uses the same seqid and slot and in this case it's used by > > 3opens and a test_stateid. > > > > Client in all this mess never processes the open stateid seq=40 and > > keeps on resending CLOSE with seq=37 (also to note client "missed" > > processing seqid=38 and 39 as well. 39 probably because it was a > > reply > > on the same kind of "Reused" slot=1 and seq=0x000128f7. I haven't > > tracked 38 but i'm assuming it's the same). I don't know how many > > times but after 5mins, I see a TEST_STATEID that again uses the same > > seqid+slot (which gets a reply from the cache matching OPEN). Also > > open + close (still with seq=37) open is replied to but after this > > client goes into a soft lockup logs have > > "nfs4_schedule_state_manager: > > kthread_ruan: -4" over and over again . then a soft lockup. > > > > Looking back on slot 0. nfs.seqid=0x000002f3 was used in frame=415866 > > by the TEST_STATEID. This is replied to in frame 415877 (with an > > ERR_DELAY). But before the client got a reply, it used the slot and > > the seq by frame 415874. TEST_STATEID is a synchronous and > > interruptible operation. I'm suspecting that somehow it was > > interrupted and that's who the slot was able to be re-used by the > > frame 415874. But how the several opens were able to get the same > > slot > > I don't know.. > > Is this still true with the current linux-next? I would expect this > patch > http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=commitdiff;h=3453d5708b33efe76f40eca1c0ed60923094b971 > to change the Linux client behaviour in the above regard. Yes. I reproduced it against your origin/testing (5.0.0-rc7+) commit 0d1bf3407c4ae88 ("SUNRPC: allow dynamic allocation of back channel slots"). > > Cheers > Trond > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >