Hi Chuck, On Mon, 2019-04-15 at 11:04 -0400, Chuck Lever wrote: > Just happened again. Any thoughts about where I should start looking? > > Mon Apr 15 11:01:40 EDT 2019 > 4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, > (T) 4096B-4096B, ioengine=libaio, iodepth=1024 > ... > fio-3.1 > Starting 12 processes > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > 4k100test: Laying out IO file (1 file / 1024MiB) > fio: native_fallocate call failed: Operation not supported > fio: io_u error on file 4k100test.7.0: Invalid slot: read > offset=938229760, buflen=4096 Does the following patch fix the race? 8<-------------------------------------- >From 4c8759eafad9bb7ea2626a53296e30618aeefcc7 Mon Sep 17 00:00:00 2001 From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> Date: Mon, 15 Apr 2019 11:54:13 -0400 Subject: [PATCH] SUNRPC: Ignore queue transmission errors on successful transmission If a request transmission fails due to write space or slot unavailability errors, but the queued task then gets transmitted before it has time to process the error in call_transmit_status() or call_bc_transmit_status(), we need to suppress the transmission error code to prevent it from leaking out of the RPC layer. Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx> Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> --- net/sunrpc/clnt.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index fa900bb44cd5..369a2648dafc 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -2101,8 +2101,8 @@ call_transmit_status(struct rpc_task *task) * test first. */ if (rpc_task_transmitted(task)) { - if (task->tk_status == 0) - xprt_request_wait_receive(task); + task->tk_status = 0; + xprt_request_wait_receive(task); return; } @@ -2187,6 +2187,9 @@ call_bc_transmit_status(struct rpc_task *task) { struct rpc_rqst *req = task->tk_rqstp; + if (rpc_task_transmitted(task)) + task->tk_status = 0; + dprint_status(task); switch (task->tk_status) { -- 2.20.1 -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx