Re: [PATCH v5 27/32] netfs: Change the read result collector to only use one work item

Marc Dionne <marc.dionne@xxxxxxxxxxxx> · Fri, 24 Jan 2025 14:21:35 -0400

On Fri, Jan 24, 2025 at 2:00 PM Ihor Solodrai <ihor.solodrai@xxxxx> wrote:
>
> On Monday, December 16th, 2024 at 12:41 PM, David Howells <dhowells@xxxxxxxxxx> wrote:
>
> > Change the way netfslib collects read results to do all the collection for
> > a particular read request using a single work item that walks along the
> > subrequest queue as subrequests make progress or complete, unlocking folios
> > progressively rather than doing the unlock in parallel as parallel requests
> > come in.
> >
> > The code is remodelled to be more like the write-side code, though only
> > using a single stream. This makes it more directly comparable and thus
> > easier to duplicate fixes between the two sides.
> >
> > This has a number of advantages:
> >
> > (1) It's simpler. There doesn't need to be a complex donation mechanism
> > to handle mismatches between the size and alignment of subrequests and
> > folios. The collector unlocks folios as the subrequests covering each
> > complete.
> >
> > (2) It should cause less scheduler overhead as there's a single work item
> > in play unlocking pages in parallel when a read gets split up into a
> > lot of subrequests instead of one per subrequest.
> >
> > Whilst the parallellism is nice in theory, in practice, the vast
> > majority of loads are sequential reads of the whole file, so
> > committing a bunch of threads to unlocking folios out of order doesn't
> > help in those cases.
> >
> > (3) It should make it easier to implement content decryption. A folio
> > cannot be decrypted until all the requests that contribute to it have
> > completed - and, again, most loads are sequential and so, most of the
> > time, we want to begin decryption sequentially (though it's great if
> > the decryption can happen in parallel).
> >
> > There is a disadvantage in that we're losing the ability to decrypt and
> > unlock things on an as-things-arrive basis which may affect some
> > applications.
> >
> > Signed-off-by: David Howells dhowells@xxxxxxxxxx
> >
> > cc: Jeff Layton jlayton@xxxxxxxxxx
> >
> > cc: netfs@xxxxxxxxxxxxxxx
> > cc: linux-fsdevel@xxxxxxxxxxxxxxx
> > ---
> > fs/9p/vfs_addr.c | 3 +-
> > fs/afs/dir.c | 8 +-
> > fs/ceph/addr.c | 9 +-
> > fs/netfs/buffered_read.c | 160 ++++----
> > fs/netfs/direct_read.c | 60 +--
> > fs/netfs/internal.h | 21 +-
> > fs/netfs/main.c | 2 +-
> > fs/netfs/objects.c | 34 +-
> > fs/netfs/read_collect.c | 716 ++++++++++++++++++++---------------
> > fs/netfs/read_pgpriv2.c | 203 ++++------
> > fs/netfs/read_retry.c | 207 +++++-----
> > fs/netfs/read_single.c | 37 +-
> > fs/netfs/write_collect.c | 4 +-
> > fs/netfs/write_issue.c | 2 +-
> > fs/netfs/write_retry.c | 14 +-
> > fs/smb/client/cifssmb.c | 2 +
> > fs/smb/client/smb2pdu.c | 5 +-
> > include/linux/netfs.h | 16 +-
> > include/trace/events/netfs.h | 79 +---
> > 19 files changed, 819 insertions(+), 763 deletions(-)
>
> Hello David.
>
> After recent merge from upstream BPF CI started consistently failing
> with a task hanging in v9fs_evict_inode. I bisected the failure to
> commit e2d46f2ec332, pointing to this patch.
>
> Reverting the patch seems to have helped:
> https://github.com/kernel-patches/vmtest/actions/runs/12952856569
>
> Could you please investigate?
>
> Examples of failed jobs:
>   * https://github.com/kernel-patches/bpf/actions/runs/12941732247
>   * https://github.com/kernel-patches/bpf/actions/runs/12933849075
>
> A log snippet:
>
>     2025-01-24T02:15:03.9009694Z [  246.932163] INFO: task ip:1055 blocked for more than 122 seconds.
>     2025-01-24T02:15:03.9013633Z [  246.932709]       Tainted: G           OE      6.13.0-g2bcb9cf535b8-dirty #149
>     2025-01-24T02:15:03.9018791Z [  246.933249] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>     2025-01-24T02:15:03.9025896Z [  246.933802] task:ip              state:D stack:0     pid:1055  tgid:1055  ppid:1054   flags:0x00004002
>     2025-01-24T02:15:03.9028228Z [  246.934564] Call Trace:
>     2025-01-24T02:15:03.9029758Z [  246.934764]  <TASK>
>     2025-01-24T02:15:03.9032572Z [  246.934937]  __schedule+0xa91/0xe80
>     2025-01-24T02:15:03.9035126Z [  246.935224]  schedule+0x41/0xb0
>     2025-01-24T02:15:03.9037992Z [  246.935459]  v9fs_evict_inode+0xfe/0x170
>     2025-01-24T02:15:03.9041469Z [  246.935748]  ? __pfx_var_wake_function+0x10/0x10
>     2025-01-24T02:15:03.9043837Z [  246.936101]  evict+0x1ef/0x360
>     2025-01-24T02:15:03.9046624Z [  246.936340]  __dentry_kill+0xb0/0x220
>     2025-01-24T02:15:03.9048855Z [  246.936610]  ? dput+0x3a/0x1d0
>     2025-01-24T02:15:03.9051128Z [  246.936838]  dput+0x114/0x1d0
>     2025-01-24T02:15:03.9053548Z [  246.937069]  __fput+0x136/0x2b0
>     2025-01-24T02:15:03.9056154Z [  246.937305]  task_work_run+0x89/0xc0
>     2025-01-24T02:15:03.9058593Z [  246.937571]  do_exit+0x2c6/0x9c0
>     2025-01-24T02:15:03.9061349Z [  246.937816]  do_group_exit+0xa4/0xb0
>     2025-01-24T02:15:03.9064401Z [  246.938090]  __x64_sys_exit_group+0x17/0x20
>     2025-01-24T02:15:03.9067235Z [  246.938390]  x64_sys_call+0x21a0/0x21a0
>     2025-01-24T02:15:03.9069924Z [  246.938672]  do_syscall_64+0x79/0x120
>     2025-01-24T02:15:03.9072746Z [  246.938941]  ? clear_bhb_loop+0x25/0x80
>     2025-01-24T02:15:03.9075581Z [  246.939230]  ? clear_bhb_loop+0x25/0x80
>     2025-01-24T02:15:03.9079275Z [  246.939510]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>     2025-01-24T02:15:03.9081976Z [  246.939875] RIP: 0033:0x7fb86f66f21d
>     2025-01-24T02:15:03.9087533Z [  246.940153] RSP: 002b:00007ffdb3cf93f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
>     2025-01-24T02:15:03.9092590Z [  246.940689] RAX: ffffffffffffffda RBX: 00007fb86f785fa8 RCX: 00007fb86f66f21d
>     2025-01-24T02:15:03.9097722Z [  246.941201] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000000
>     2025-01-24T02:15:03.9102762Z [  246.941705] RBP: 00007ffdb3cf9450 R08: 00007ffdb3cf93a0 R09: 0000000000000000
>     2025-01-24T02:15:03.9107940Z [  246.942215] R10: 00007ffdb3cf92ff R11: 0000000000000202 R12: 0000000000000001
>     2025-01-24T02:15:03.9113002Z [  246.942723] R13: 0000000000000000 R14: 0000000000000000 R15: 00007fb86f785fc0
>     2025-01-24T02:15:03.9114614Z [  246.943244]  </TASK>

That looks very similar to something I saw in afs testing, with a
similar stack but in afs_evict_inode where it hung waiting in
netfs_wait_for_outstanding_io.

David pointed to this bit where there's a double get in
netfs_retry_read_subrequests, since netfs_reissue_read already takes
care of getting a ref on the subrequest:

diff --git a/fs/netfs/read_retry.c b/fs/netfs/read_retry.c
index 2290af0d51ac..53d62e31a4cc 100644
--- a/fs/netfs/read_retry.c
+++ b/fs/netfs/read_retry.c
@@ -152,7 +152,6 @@ static void netfs_retry_read_subrequests(struct
netfs_io_request *rreq)
                                __clear_bit(NETFS_SREQ_BOUNDARY,
&subreq->flags);
                        }

-                       netfs_get_subrequest(subreq,
netfs_sreq_trace_get_resubmit);
                        netfs_reissue_read(rreq, subreq);
                        if (subreq == to)
                                break;

That seems to help for my afs test case, I suspect it might help in
your case as well.

Marc