On Wed, 2021-03-03 at 08:10 -0500, Benjamin Coddington wrote: > Hi Trond, > > I'd like to go back to setting sk->sk_allocation = GFP_NOIO (see > a1231fda7e94). That would cover sync tasks as well as async, but I'm > not > sure what memalloc_nofs_save/restore gives us and if we should just > try > to > apply that to all tasks. > > We're getting some folks deadlocked on the xprt_sending queue with > stacks like: > > #0 [ffffacab45f17108] __schedule at ffffffffae4d1826 > #1 [ffffacab45f171a0] schedule at ffffffffae4d1cb8 > #2 [ffffacab45f171b0] rpc_wait_bit_killable at ffffffffc067d44e > [sunrpc] > #3 [ffffacab45f171c8] __wait_on_bit at ffffffffae4d216c > #4 [ffffacab45f17200] out_of_line_wait_on_bit at ffffffffae4d2211 > #5 [ffffacab45f17250] __rpc_execute at ffffffffc067f3fc [sunrpc] > #6 [ffffacab45f172a8] rpc_run_task at ffffffffc06732c4 [sunrpc] > #7 [ffffacab45f172e8] nfs4_proc_layoutreturn at ffffffffc08f5d44 > [nfsv4] > #8 [ffffacab45f17388] pnfs_send_layoutreturn at ffffffffc091946e > [nfsv4] > #9 [ffffacab45f173d8] _pnfs_return_layout at ffffffffc091ba8b > [nfsv4] > #10 [ffffacab45f17450] nfs4_evict_inode at ffffffffc0906a05 [nfsv4] > #11 [ffffacab45f17460] evict at ffffffffadef8592 > #12 [ffffacab45f17480] dispose_list at ffffffffadef86a8 > #13 [ffffacab45f174a0] prune_icache_sb at ffffffffadef99a2 > #14 [ffffacab45f174c8] super_cache_scan at ffffffffadede183 > #15 [ffffacab45f17518] do_shrink_slab at ffffffffade3d5d8 > #16 [ffffacab45f17588] shrink_slab at ffffffffade3dab5 > #17 [ffffacab45f17608] shrink_node at ffffffffade42a8c > #18 [ffffacab45f17678] do_try_to_free_pages at ffffffffade42e43 > #19 [ffffacab45f176c8] try_to_free_pages at ffffffffade431c8 > #20 [ffffacab45f17768] __alloc_pages_slowpath at ffffffffade81981 > #21 [ffffacab45f17868] __alloc_pages_nodemask at ffffffffade82555 > #22 [ffffacab45f178c8] skb_page_frag_refill at ffffffffae31bea7 > #23 [ffffacab45f178e0] sk_page_frag_refill at ffffffffae31c71d > #24 [ffffacab45f178f8] tcp_sendmsg_locked at ffffffffae3cbe65 > #25 [ffffacab45f179a0] tcp_sendmsg at ffffffffae3cc8f7 > #26 [ffffacab45f179c0] sock_sendmsg at ffffffffae317cce > #27 [ffffacab45f179d8] xs_sendpages at ffffffffc0679741 [sunrpc] > #28 [ffffacab45f17ac8] xs_tcp_send_request at ffffffffc067adb4 > [sunrpc] > #29 [ffffacab45f17b20] xprt_transmit at ffffffffc067674c [sunrpc] > #30 [ffffacab45f17b90] call_transmit at ffffffffc0672064 [sunrpc] > #31 [ffffacab45f17ba0] __rpc_execute at ffffffffc067f365 [sunrpc] > #32 [ffffacab45f17bf8] rpc_run_task at ffffffffc06732c4 [sunrpc] > #33 [ffffacab45f17c38] nfs4_call_sync_custom at ffffffffc08e50bb > [nfsv4] > #34 [ffffacab45f17c48] nfs4_call_sync_sequence at ffffffffc08e5143 > [nfsv4] > #35 [ffffacab45f17cb8] _nfs4_proc_getattr at ffffffffc08e7f08 [nfsv4] > #36 [ffffacab45f17d78] nfs4_proc_getattr at ffffffffc08f200a [nfsv4] > #37 [ffffacab45f17de8] __nfs_revalidate_inode at ffffffffc08741d7 > [nfs] > #38 [ffffacab45f17e18] nfs_getattr at ffffffffc0874458 [nfs] > #39 [ffffacab45f17e60] vfs_statx_fd at ffffffffadedf8a4 > #40 [ffffacab45f17e98] __do_sys_newfstat at ffffffffadedfedd > #41 [ffffacab45f17f38] do_syscall_64 at ffffffffadc0419b > #42 [ffffacab45f17f50] entry_SYSCALL_64_after_hwframe at > ffffffffae6000ad > RIP: 00007f721ddcdd37 RSP: 00007ffc0d54cab8 RFLAGS: 00000246 > RAX: ffffffffffffffda RBX: 00007f721e09b3c0 RCX: > 00007f721ddcdd37 > RDX: 00007ffc0d54cac0 RSI: 00007ffc0d54cac0 RDI: > 0000000000000001 > RBP: 00007f721e09f6c0 R8: 00007f721f87cf00 R9: > 0000000000000000 > R10: 00007ffc0d54a32a R11: 0000000000000246 R12: > 00007f721e09b3c0 > R13: 000055606b81376e R14: 0000000000000013 R15: > 00007f721e09b3c0 > ORIG_RAX: 0000000000000005 CS: 0033 SS: 002b > > Ben > Please just wrap that call to __rpc_execute() in rpc_execute() with a memalloc_nofs_save()/memalloc_nofs_restore() pair. That should cause the mm layer to do the correct thing here, and prevent re-entry into the NFS code. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx