> On Apr 15, 2019, at 12:05 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > Hi Chuck, > > > On Mon, 2019-04-15 at 11:04 -0400, Chuck Lever wrote: >> Just happened again. Any thoughts about where I should start looking? >> >> Mon Apr 15 11:01:40 EDT 2019 >> 4k100test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, >> (T) 4096B-4096B, ioengine=libaio, iodepth=1024 >> ... >> fio-3.1 >> Starting 12 processes >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> 4k100test: Laying out IO file (1 file / 1024MiB) >> fio: native_fallocate call failed: Operation not supported >> fio: io_u error on file 4k100test.7.0: Invalid slot: read >> offset=938229760, buflen=4096 > > Does the following patch fix the race? > > 8<-------------------------------------- > From 4c8759eafad9bb7ea2626a53296e30618aeefcc7 Mon Sep 17 00:00:00 2001 > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > Date: Mon, 15 Apr 2019 11:54:13 -0400 > Subject: [PATCH] SUNRPC: Ignore queue transmission errors on successful > transmission > > If a request transmission fails due to write space or slot unavailability > errors, but the queued task then gets transmitted before it has time to > process the error in call_transmit_status() or call_bc_transmit_status(), > we need to suppress the transmission error code to prevent it from leaking > out of the RPC layer. > > Reported-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx> > --- > net/sunrpc/clnt.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > index fa900bb44cd5..369a2648dafc 100644 > --- a/net/sunrpc/clnt.c > +++ b/net/sunrpc/clnt.c > @@ -2101,8 +2101,8 @@ call_transmit_status(struct rpc_task *task) > * test first. > */ > if (rpc_task_transmitted(task)) { > - if (task->tk_status == 0) > - xprt_request_wait_receive(task); > + task->tk_status = 0; > + xprt_request_wait_receive(task); > return; > } > > @@ -2187,6 +2187,9 @@ call_bc_transmit_status(struct rpc_task *task) > { > struct rpc_rqst *req = task->tk_rqstp; > > + if (rpc_task_transmitted(task)) > + task->tk_status = 0; > + > dprint_status(task); > > switch (task->tk_status) { I was about to try something like this. I don't have a 100% reproducer. I will apply your patch and wait for the problem to appear over the next few days. -- Chuck Lever