> On Dec 31, 2018, at 1:59 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Mon, 2018-12-31 at 13:44 -0500, Chuck Lever wrote: >>> On Dec 31, 2018, at 1:09 PM, Trond Myklebust < >>> trondmy@xxxxxxxxxxxxxxx> wrote: >>> >>> On Thu, 2018-12-27 at 17:34 -0500, Chuck Lever wrote: >>>>> On Dec 27, 2018, at 5:14 PM, Trond Myklebust < >>>>> trondmy@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> >>>>> >>>>>> On Dec 27, 2018, at 20:21, Chuck Lever < >>>>>> chuck.lever@xxxxxxxxxx> >>>>>> wrote: >>>>>> >>>>>> Hi Trond- >>>>>> >>>>>> I've chased down a couple of remaining regressions with the >>>>>> v4.20 >>>>>> NFS client, >>>>>> and they seem to be rooted in this commit. >>>>>> >>>>>> When using sec=krb5, krb5i, or krb5p I found that multi- >>>>>> threaded >>>>>> workloads >>>>>> trigger a lot of server-side disconnects. This is with TCP >>>>>> and >>>>>> RDMA transports. >>>>>> An instrumented server shows that the client is under-running >>>>>> the >>>>>> GSS sequence >>>>>> number window. I monitored the order in which GSS sequence >>>>>> numbers appear on >>>>>> the wire, and after this commit, the sequence numbers are >>>>>> wildly >>>>>> misordered. >>>>>> If I revert the hunk in xprt_request_enqueue_transmit, the >>>>>> problem goes away. >>>>>> >>>>>> I also found that reverting that hunk results in a 3-4% >>>>>> improvement in fio >>>>>> IOPS rates, as well as improvement in average and maximum >>>>>> latency >>>>>> as reported >>>>>> by fio. >>>>>> >>>>> >>>>> Hmm… Provided the sequence numbers still lie within the window, >>>>> then why would the order matter? >>>> >>>> The misordering is so bad that one request is delayed long enough >>>> to >>>> fall outside the window. The new “need re-encode” logic does not >>>> trigger. >>>> >>> >>> That's weird. I can't see anything wrong with need re-encode at >>> this >>> point. >> >> I don't think there is anything wrong with it, it looks like it's >> not called in this case. > > So you are saying that the call to rpcauth_xmit_need_reencode() is > triggering the EBADMSG, but that this fails to cause a re-encode of the > message? No, I think what's going on is that the need_reencode happens when the RPC is enqueued, and is successful. But xprt_request_enqueue_transmit places the RPC somewhere in the middle of xmit_queue. xmit_queue is long enough that more than 128 requests are before the enqueued request. >>> Do the window sizes agree on the client and the server? >> >> Yes, both are 128. I also tried with 64 on the client side and 128 >> on the server side. That reduces the frequency of disconnects, but >> does not eliminate them. >> >> I'm not clear what problem the logic in xprt_request_enqueue_transmit >> is trying to address. It seems to me that the initial, simple >> implementation of this function is entirely adequate..? > > I agree that the fair queueing code could result in a reordering that > could screw up the RPCSEC_GSS sequencing. However, we do expect the > need reencode stuff to catch that. -- Chuck Lever