Re: sporadic hangs on generic/186

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 08 Apr 2022, Dave Chinner wrote:
> On Fri, Apr 08, 2022 at 12:46:08PM +1000, NeilBrown wrote:
> > On Thu, 07 Apr 2022, Trond Myklebust wrote:
> > > The bottom line is that we use ordinary GFP_KERNEL memory allocations
> > > where we can. The new code follows that rule, breaking it only in cases
> > > where the specific rules of rpciod/xprtiod/nfsiod make it impossible to
> > > wait forever in the memory manager.
> > 
> > It is not safe to use GFP_KERNEL for an allocation that is needed in
> > order to free memory - and so any allocation that is needed to write out
> > data from the page cache.
> 
> Except that same page cache writeback path can be called from
> syscall context (e.g.  fsync()) which has nothing to do with memory
> reclaim. In that case GFP_KERNEL is the correct allocation context
> to use because there are no constraints on what memory reclaim can
> be performed from this path.
> 
> IOWs, if the context initiating data writeback doesn't allow
> GFP_KERNEL allocations, then it should be calling
> memalloc_nofs_save() or memalloc_noio_save() to constrain all
> allocations to the required context. We should not be requiring the
> filesystem (or any other subsystem) to magically infer that the IO
> is being done in a constrained allocation context and modify the
> context they use appropriately.
> 
> If we this, then all filesystems would simply use GFP_NOIO
> everywhere because the loop device layers the entire filesystem IO
> path under block device context (i.e. requiring GFP_NOIO allocation
> context). We don't do this - the loop device sets PF_MEMALLOC_NOIO
> instead so all allocations in that path run with at least GFP_NOIO
> constraints and filesystems are none the wiser about the constraints
> of the calling context.
> 
> IOWs, GFP_KERNEL is generally right context to be using in
> filesystem IO paths and callers need to restrict allocation contexts
> via task flags if they cannot allow certain types of reclaim
> recursion to occur...

NOIO and NOFS are not the issue here.  We all agree that
memalloc_noXX_save() is the right thing to do.

The issue is that memalloc can block indefinitely in low-memory
situations, and any code that has to make progress in low-memory
situations - like writeout - needs to be careful.

This is why the block layer uses mempools for request headers etc - so
that progress is guaranteed without depending on alloc_page() to
succeed.

File systems do often get away with using GFP_KERNEL because the
important paths has PF_MEMALLOC and hence __GFP_MEMALLOC in effect and
that provides access to some shared reserves.  Shared reserves are risky
- the other users you are sharing with might steal it all.

File systems tend to survive anyway because there is a limit on the
mount of dirty filesystem data - so there is lots of non-filesystem data
around, and a good chance that some of that can be freed.

I say "tend to" because I believe the is no real guarantee.  It seems to
actually work 99.999% of the time, and maybe that is enough.

I suspect you might be able to deadlock filesystem writeout by
memlocking lots of memory while there are lots of dirty pages.  It
probably wouldn't be easy though.

swap-out is different.  There is no limit the the amount of dirty anon
data, so it is fairly easy to get a situation where you absolutely must
write out some anon pages before alloc_page() can succeed. 
Most filesystems don't handle swap-out directly - they just tell the MM
which disk addresses to use and submit_bio() is used for writing.
The bio is allocated from a mempool, and nothing below submit_bio() uses
GFP_KERNEL to alloc_page() - they all use mempools (or accept failure in
some other way).  A separate mempool at each level - they aren't shared
(so they are quite different from __GFP_MEMALLOC).

NFS is different.  NFS handles swap using the same paths as other
writes, so it is much more likely to face indefinite waits in
alloc_page() - it least when handling swap.  __GFP_MEMALLOC helps to a
degree but there a lots of levels, and the more levels we have have
local reserves (even if the mempool only reserves a single element), the
better.

The networking people refuse to use mempools (or at least, they did once
some years ago) and I cannot entirely blame them as there are lots of
moving parts - lots of different things that might need to be allocated
(arp table entries?) but usually aren't.  So for everything in the
socket layer and below we rely on __GFP_MEMALLOC (and recommend
increasing the reserves a bit above the default, just in case).

But in NFS and particularly in SUNRPC we already have the mempool, and
there is only a small number of things that we need to allocate to
ensure forward progress.  So using a mempool as designed, rather than
relying on MEMALLOC reserves is the sensible thing to do, and leaves
more of the reserves for the networking layer.

In fact, the allocation that SUNRPC now does before trying a mempool
should really be using __GFP_NOMEMALLOC so that they don't take from the
shared reserves (even when PF_MEMALLOC is set).  As it has a private
reserve (the mempool) it should leave the common reserve for other
layers (sockets etc).

NeilBrown



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux