Am testing this fix now (especially with multichannel scenarios which were failing before) On Fri, Feb 28, 2025 at 11:22 AM David Howells <dhowells@xxxxxxxxxx> wrote: > > > A netfs read request can run in one of two modes: for synchronous reads > writes, the app thread does the collection of results and for asynchronous > reads, this is offloaded to a worker thread. This is controlled by the > NETFS_RREQ_OFFLOAD_COLLECTION flag. > > Now, if a subrequest incurs an error, the NETFS_RREQ_PAUSE flag is set to > stop the issuing loop temporarily from issuing more subrequests until a > retry is successful or the request is abandoned. > > When the issuing loop sees NETFS_RREQ_PAUSE, it jumps to > netfs_wait_for_pause() which will wait for the PAUSE flag to be cleared - > and whilst it is waiting, it will call out to the collector as more results > acrue... But this is the wrong thing to do if OFFLOAD_COLLECTION is set as > we can then end up with both the app thread and the work item collecting > results simultaneously. > > This manifests itself occasionally when running the generic/323 xfstest > against multichannel cifs as an oops that's a bit random but frequently > involving io_submit() (the test does lots of simultaneous async DIO reads). > > Fix this by only doing the collection in netfs_wait_for_pause() if the > NETFS_RREQ_OFFLOAD_COLLECTION is not set. > > Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") > Reported-by: Steve French <stfrench@xxxxxxxxxxxxx> > Signed-off-by: David Howells <dhowells@xxxxxxxxxx> > cc: Paulo Alcantara <pc@xxxxxxxxxxxxx> > cc: Jeff Layton <jlayton@xxxxxxxxxx> > cc: linux-cifs@xxxxxxxxxxxxxxx > cc: netfs@xxxxxxxxxxxxxxx > cc: linux-fsdevel@xxxxxxxxxxxxxxx > --- > fs/netfs/read_collect.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c > index 636cc5a98ef5..23c75755ad4e 100644 > --- a/fs/netfs/read_collect.c > +++ b/fs/netfs/read_collect.c > @@ -682,14 +682,16 @@ void netfs_wait_for_pause(struct netfs_io_request *rreq) > trace_netfs_rreq(rreq, netfs_rreq_trace_wait_queue); > prepare_to_wait(&rreq->waitq, &myself, TASK_UNINTERRUPTIBLE); > > - subreq = list_first_entry_or_null(&stream->subrequests, > - struct netfs_io_subrequest, rreq_link); > - if (subreq && > - (!test_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags) || > - test_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags))) { > - __set_current_state(TASK_RUNNING); > - netfs_read_collection(rreq); > - continue; > + if (!test_bit(NETFS_RREQ_OFFLOAD_COLLECTION, &rreq->flags)) { > + subreq = list_first_entry_or_null(&stream->subrequests, > + struct netfs_io_subrequest, rreq_link); > + if (subreq && > + (!test_bit(NETFS_SREQ_IN_PROGRESS, &subreq->flags) || > + test_bit(NETFS_SREQ_MADE_PROGRESS, &subreq->flags))) { > + __set_current_state(TASK_RUNNING); > + netfs_read_collection(rreq); > + continue; > + } > } > > if (!test_bit(NETFS_RREQ_IN_PROGRESS, &rreq->flags) || > > -- Thanks, Steve