Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

Michal Hocko <mhocko@xxxxxxxxxx> · Fri, 27 Nov 2015 16:01:33 +0100

On Fri 27-11-15 16:40:03, Vladimir Davydov wrote:
> On Fri, Nov 27, 2015 at 01:50:05PM +0100, Michal Hocko wrote:
> > On Thu 26-11-15 11:16:24, Vladimir Davydov wrote:
[...]
> > > Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
> > > because it isn't an allocation context problem: the reclaimer locks up
> > > not because it tries to take an fs/io lock the caller holds, but because
> > > it waits for isolated pages to be put back, which will never happen,
> > > since processes that isolated them depend on the kthread making
> > > progress. This is purely a reclaimer heuristic, which kmalloc users are
> > > not aware of.
> > > 
> > > My point is that, in contrast to userspace processes, it is dangerous to
> > > throttle kthreads in the reclaimer, because they might be responsible
> > > for reclaimer progress (e.g. performing writeback).
> > 
> > Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC
> > instead because it is in fact a reclaimer so it even get to the reclaim.
> 
> The driver we use is similar to loop. It works as a proxy to fs it works
> on top of. Allowing it to access emergency reserves would deplete them
> quickly, just like in case of plain loop.

OK, I see. I thought it would be using only a very limited amount of
memory for the writeback.

> The problem is not about our driver, in fact. I'm pretty sure one can
> hit it when using memcg along with loop or dm-crypt for instance.

I am not familiar much with neither but from a quick look the loop
driver doesn't use mempools tool, it simply relays the data to the
underlaying file and relies on the underlying fs to write all the pages
and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then
I am not really sure how can we guarantee a forward progress. The
GFP_NOFS allocation might loop inside the allocator endlessly and so
the writeback wouldn't make any progress. This doesn't seem to be only
memcg specific. The global case would just replace the deadlock by a
livelock. I certainly must be missing something here.

> > There way too many allocations done from the kernel thread context to be
> > not throttled (just look at worker threads).
> 
> What about throttling them only once then?

This still sounds way too broad to me and I am even not sure it solves
the problem. If anything I think we really should make it specific
only to those callers who are really required to make a forward
progress. What about PF_LESS_THROTTLE? NFS is already using this flag for a
similar purpose and we indeed do not throttle at few places during the
reclaim. So I would expect current_may_throttle(current) check there
although I must confess I have no idea about the whole condition right
now.

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 97ba9e1cde09..9253f4531b9c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1578,6 +1578,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  		/* We are about to die and free our memory. Return now. */
>  		if (fatal_signal_pending(current))
>  			return SWAP_CLUSTER_MAX;
> +
> +		if (current->flags & PF_KTHREAD)
> +			break;
>  	}
>  
>  	lru_add_drain();
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>