On Thu, 07 Apr 2022, Trond Myklebust wrote: > On Thu, 2022-04-07 at 14:11 +1000, NeilBrown wrote: > > On Thu, 07 Apr 2022, Trond Myklebust wrote: > > > On Thu, 2022-04-07 at 11:19 +1000, NeilBrown wrote: > > > > On Thu, 07 Apr 2022, NeilBrown wrote: > > > > > On Thu, 07 Apr 2022, Dave Chinner wrote: > > > > > > On Wed, Apr 06, 2022 at 03:54:24PM -0400, J. Bruce Fields > > > > > > wrote: > > > > > > > In the last couple days I've started getting hangs on > > > > > > > xfstests > > > > > > > generic/186 on upstream. I also notice the test completes > > > > > > > after 10+ > > > > > > > hours (usually it takes about 5 minutes). Sometimes this > > > > > > > is > > > > > > > accompanied > > > > > > > by "nfs: RPC call returned error 12" on the client. > > > > > > > > > > > > #define ENOMEM 12 /* Out of memory */ > > > > > > > > > > > > So either the client or the server is running out of memory > > > > > > somewhere? > > > > > > > > > > Probably the client. There are a bunch of changes recently > > > > > which > > > > > add > > > > > __GFP_NORETRY to memory allocations from PF_WQ_WORKERs because > > > > > that > > > > > can > > > > > result in deadlocks when swapping over NFS. > > > > > This means that kmalloc request that previously never failed > > > > > (because > > > > > GFP_KERNEL never fails for kernel threads I think) can now > > > > > fail. > > > > > This > > > > > has tickled one bug that I know of. There are likely to be > > > > > more. > > > > > > > > > > The RPC code should simply retry these allocations after a > > > > > short > > > > > delay. > > > > > HZ/4 is the number that is used in a couple of places. > > > > > Possibly > > > > > there > > > > > are more places that need to handle -ENOMEM with rpc_delay(). > > > > > > > > I had a look through the various places where alloc can now fail. > > > > > > > > I think xdr_alloc_bvec() in xprt_sent_pagedata() is the most > > > > likely > > > > cause of a problem here. I don't think an -ENOMEM from there is > > > > caught, > > > > so it could likely filter up to NFS and result in the message you > > > > got. > > > > > > > > I don't think we can easily handle failure there. We need to > > > > stay > > > > with > > > > GFP_KERNEL rely on PF_MEMALLOC to make forward progress for > > > > swap-over-NFS. > > > > > > > > Bruce: can you change that one line back to GFP_KERNEL and see if > > > > the > > > > problem goes away? > > > > > > > > > > Can we please just move the call to xdr_alloc_bvec() out of > > > xprt_send_pagedata()? Move the client side allocation into > > > xs_stream_prepare_request() and xs_udp_send_request(), then move > > > the > > > server side allocation into svc_udp_sendto(). > > > > > > That makes it possible to handle errors. > > > > Like the below I guess. Seems sensible, but I don't know the code > > well > > enough to really review it. > > > > > > > > > The other problem I found is that rpc_alloc_task() can now fail, > > > > but > > > > rpc_new_task assumes that it never will. If it does, then we get > > > > a > > > > NULL > > > > deref. > > > > > > > > I don't think rpc_new_task() can ever be called from the rpciod > > > > work > > > > queue, so it is safe to just use a mempool with GFP_KERNEL like > > > > we > > > > did > > > > before. > > > > > > > No. We shouldn't ever use mempools with GFP_KERNEL. > > > > Why not? mempools with GFP_KERNEL make perfect sense outside of the > > rpciod and nfsiod threads. > > If you can afford to make it an infinite wait, there is __GFP_NOFAIL, > so why waste the resources of an emergency pool? In my opinion, > however, an infinite uninterruptible sleep is bad policy for almost all > cases because someone will want to break out at some point. "infinite" isn't a useful description. The important question is "what will allow the allocation to complete?". For __GFP_NOFAIL there is no clear answer beyond "reduction of memory pressure", and sometimes that is enough. For a mempool we have a much more specific answer. Memory becomes available as previous requests complete. rpc_task_mempool has a size of 8 so there can always be 8 requests in flight. Waiting on the mempool will wait at most until there are seven requests in flight, and then will return a task. This is a much better guarantee than for __GFP_NOFAIL. If you ever need an rpc task to relieve memory pressure, then __GFP_NOFAIL could deadlock, while using the mempool won't. If you are never going to block waiting on a mempool and would rather fail instead, then there seems little point in having the mempool. > > > > > > > > > Most, if not all of the callers of rpc_run_task() are still capable > > > of > > > dealing with errors, and ditto for the callers of > > > rpc_run_bc_task(). > > > > Yes, they can deal with errors. But in many cases that do so by > > passing > > the error up the call stack so we could start getting ENOMEM for > > systemcalls like stat(). I don't think that is a good idea. > > > > stat() has always been capable of returning ENOMEM if, for instance, > inode allocation fails. There are almost no calls in NFS (or most other > filesystems for that matter) that can't fail somehow when memory starts > to get really scarce. Fair enough. It is writes that are really important. > > The bottom line is that we use ordinary GFP_KERNEL memory allocations > where we can. The new code follows that rule, breaking it only in cases > where the specific rules of rpciod/xprtiod/nfsiod make it impossible to > wait forever in the memory manager. It is not safe to use GFP_KERNEL for an allocation that is needed in order to free memory - and so any allocation that is needed to write out data from the page cache. > > I am preparing a set of patches to address the issues that you've > identified, plus a case in call_transmit_status/call_bc_transmit_status > where we're not handling ENOMEM. There are also patches to fix up two > cases in the NFS code itself where we're not currently handling errors > from rpc_run_task. I look forward to reviewing them. Thanks, NeilBrown