Re: [PATCH v3] xfs: avoid deadlock when trigger memory reclaim in ->writepages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 16, 2020 at 7:06 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jun 15, 2020 at 04:53:31PM +0200, Michal Hocko wrote:
> > On Mon 15-06-20 16:25:52, Holger Hoffstätte wrote:
> > > On 2020-06-15 13:56, Yafang Shao wrote:
> > [...]
> > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > > index b356118..1ccfbf2 100644
> > > > --- a/fs/xfs/xfs_aops.c
> > > > +++ b/fs/xfs/xfs_aops.c
> > > > @@ -573,9 +573,21 @@ static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
> > > >           struct writeback_control *wbc)
> > > >   {
> > > >           struct xfs_writepage_ctx wpc = { };
> > > > + unsigned int nofs_flag;
> > > > + int ret;
> > > >           xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
> > > > - return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > +
> > > > + /*
> > > > +  * We can allocate memory here while doing writeback on behalf of
> > > > +  * memory reclaim.  To avoid memory allocation deadlocks set the
> > > > +  * task-wide nofs context for the following operations.
> > > > +  */
> > > > + nofs_flag = memalloc_nofs_save();
> > > > + ret = iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
> > > > + memalloc_nofs_restore(nofs_flag);
> > > > +
> > > > + return ret;
> > > >   }
> > > >   STATIC int
> > > >
> > >
> > > Not sure if I did something wrong, but while the previous version of this patch
> > > worked fine, this one gave me (with v2 removed obviously):
> > >
> > > [  +0.000004] WARNING: CPU: 0 PID: 2811 at fs/iomap/buffered-io.c:1544 iomap_do_writepage+0x6b4/0x780
> >
> > This corresponds to
> >         /*
> >          * Given that we do not allow direct reclaim to call us, we should
> >          * never be called in a recursive filesystem reclaim context.
> >          */
> >         if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> >                 goto redirty;
> >
> > which effectivelly says that memalloc_nofs_save/restore cannot be used
> > for that code path.
>
> No it doesn't. Everyone is ignoring the -context- of this code, most
> especially the previous 3 lines of code and it's comment:
>
>         /*
>          * Refuse to write the page out if we are called from reclaim context.
>          *
>          * This avoids stack overflows when called from deeply used stacks in
>          * random callers for direct reclaim or memcg reclaim.  We explicitly
>          * allow reclaim from kswapd as the stack usage there is relatively low.
>          *
>          * This should never happen except in the case of a VM regression so
>          * warn about it.
>          */
>         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
>                         PF_MEMALLOC))
>                 goto redirty;
>
> You will see that we specifically avoid this path from reclaim
> context except for kswapd. And kswapd always runs with GFP_KERNEL
> context so we allow writeback from it, so it will pass both this
> check and the NOFS check above.
>
> IOws, we can't trigger to the WARN_ON_ONCE(current->flags &
> PF_MEMALLOC_NOFS)) check from a memory reclaim context: this
> PF_MEMALLOC_NOFS check here is not doing what people think it is.
>
> History lesson time, eh?
>
> The recursion protection here -used- to use PF_FSTRANS, not
> PF_MEMALLOC_NOFS. See commit 9070733b4efac ("xfs: abstract
> PF_FSTRANS to PF_MEMALLOC_NOFS"). This hunk is most instructive
> when you look at the comment:
>
>          * Given that we do not allow direct reclaim to call us, we should
>          * never be called while in a filesystem transaction.
>          */

This comment is more clear.

> -       if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +       if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>                 goto redirty;
>
> It wasn't for memory allocation recursion protection in XFS - it was
> for transaction reservation recursion protection by something trying
> to flush data pages while holding a transaction reservation. Doing
> this could deadlock the journal because the existing reservation
> could prevent the nested reservation for being able to reserve space
> in the journal and that is a self-deadlock vector.
>
> IOWs, this check is not protecting against memory reclaim recursion
> bugs at all (that's the previous check I quoted). This check is
> protecting against the filesystem calling writepages directly from a
> context where it can self-deadlock.
>
> So what we are seeing here is that the PF_FSTRANS ->
> PF_MEMALLOC_NOFS abstraction lost all the actual useful information
> about what type of error this check was protecting against.
>

Agreed. The check of PF_MEMALLOC_NOFS here really confused me before.
I think it will confuse others as well.

> > Your stack trace doesn't point to a reclaim path
> > which shows that this path is shared and also underlines that this is
> > not really an intended use of the api.
>
> Actually, Michal, it was your PF_FSTRANS -> PF_MEMALLOC_NOFS
> abstraction of this code that turned this from "exactly what
> PF_FSTRANS was intended to catch" to what you call "unintended use
> of the API"....
>
> IOWs, putting the iomap_writepage path under NOFS context is the
> right thing to do from a "prevent memory reclaim" perspective, but
> now we are hitting against the problems of repurposing filesystem
> specific flags for subtlely different generic semantics...
>
> I suspect we need to re-introduce PF_FSTRANS, set/clear/transfer it
> again in all the places XFS used to transfer it, and change this
> iomap check to use PF_FSTRANS and not PF_MEMALLOC_NOFS, because it's
> clearly not and never has been a memory reclaim recursion warning
> check....
>

Agree with you that it's better to convert it back to PF_FSTRANS,
otherwise it will mislead more people.
I would like to send a patch to change it back if there is no
objection to this proposal.



--
Thanks
Yafang




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux