Re: [PATCH] loop: add WQ_MEM_RECLAIM flag to per device workqueue

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 23 Mar 2022 09:59:14 +1100

On Tue, Mar 22, 2022 at 12:19:40PM -1000, Tejun Heo wrote:
> On Wed, Mar 23, 2022 at 07:05:56AM +0900, Tetsuo Handa wrote:
> > > Hmmm... yeah, I actually don't know the exact dependency here and the
> > > dependency may not be real - e.g. the conclusion might be that loop is
> > > conflating different uses and needs to split its use of workqueues into two
> > > separate ones. Tetsuo, can you post more details on the warning that you're
> > > seeing?
> > > 
> > 
> > It was reported at https://lore.kernel.org/all/20210322060334.GD32426@xsang-OptiPlex-9020/ .
> 
> Looks like a correct dependency to me. The work item is being flushed from
> good old write path. Dave?

The filesystem buffered write IO path isn't part of memory reclaim -
it's a user IO path and I think most filesystems will treat it that
way.

We've had similar layering problems with the loop IO path implyingi
GFP_NOFS must be used by filesystems allocating memory in the IO
path - we solved that by requiring the loop IO submission context
(loop_process_work()) to set PF_MEMALLOC_NOIO so that it didn't
deadlock anywhere in the underlying filesystems that have no idea
that the loop device has added memory reclaim constraints to the IO
path.

This seems like it's the same layering problem - syscall facing IO
paths are designed for incoming IO from user context, not outgoing
writeback IO from memory reclaim contexts. Memory reclaim contexts
are supposed to use back end filesystem operations like
->writepages() to flush dirty data when necessary.

If the loop device IO mechanism means that every ->write_iter path
needs to be considered as directly in the memory reclaim path, then
that means a huge amount of the kernel needs to be considered as "in
memory reclaim". i.e. it's not just this one XFS workqueue that is
going have this problem - it's any workqueue that can be waited on
by the incoming IO path.

For example, network filesystem might put the network stack directly
in the IO path. Which means if we then put loop on top of that
filesystems, various workqueues in the network stack may now need to
be considered as running under the memory reclaim path because of
the loop block device.

I don't know what the solution is, but if the fix is "xfs needs to
mark a workqueue that has nothing to do with memory reclaim as
WQ_MEM_RECLAIM because of the loop device" then we're talking about
playing workqueue whack-a-mole across the entire kernel forever
more....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx