Re: [PATCH] mm, percpu: do not consider sleepable allocations atomic

Dennis Zhou <dennis@xxxxxxxxxx> · Thu, 20 Feb 2025 18:36:14 -0800

On Fri, Feb 14, 2025 at 04:52:42PM +0100, Michal Hocko wrote:
> On Wed 12-02-25 13:39:31, Dennis Zhou wrote:
> > Hello,
> > 
> > On Wed, Feb 12, 2025 at 11:30:08AM -1000, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On Wed, Feb 12, 2025 at 09:53:20PM +0100, Michal Hocko wrote:
> > > ...
> > > > > Hmm... you'd a better judge on whether that'd be okay or not but it does
> > > > > bother me that we might be increasing the chance of allocation failures for
> > > > > GFP_KERNEL users at least under memory pressure.
> > > > 
> > > > Nope, this will not change the allocation failure mode. Reclaim
> > > > constrains do not change the failure mode they just change how much the
> > > > allocation might struggle to reclaim to succeed. 
> > > >
> > > > My undocumented assumption (another dept on my end) is that pcp
> > > > allocations are no hot paths. So the worst case is that GFP_KERNEL
> > > > pcp_allocation could have been satisfied _easier_ (i.e. faster) because
> > > > it could have reclaimed fs/io caches and now it needs to rely on kswapd
> > > > to do that on memory tight situations. On the other hand we have a
> > > > situation when NOIO/FS allocations fail prematurely so there is
> > > > certainly some pros and cons.
> > > 
> > > I'm having a hard time following. Are you saying that it won't increase the
> > > likelihood of allocation failures even under memory pressure but that it
> > > might just make allocations take longer to succeed?
> > > 
> > > NOFS/IO prevents allocation attempt from entering fs/io reclaim paths,
> > > right? It would still trigger kswapd for reclaim but can the allocation
> > > attempt wait for that to finish? If so, wouldn't that constitute a
> > > dependency cycle all the same?
> > > 
> > > All in all, percpu allocations taking longer under memory pressure is fine.
> > > Becoming more prone to allocation failures, especially for GFP_KERNEL
> > > callers, probably isn't great.
> > > 
> > 
> > Wait, I think I'm interpreting this change differently. This is
> > preventing the worker from allocating backing pages via GFP_KERNEL. It
> > isn't preventing an allocation via alloc_percpu() from being GFP_KERNEL
> > and providing those flags down to the backing page code. alloc_percpu()
> > for GFP_KERNEL allocations will populate the pages before returning.
> 
> Correct.
>  
> > I'm reading this as potentially making atomic percpu allocations fail as
> > we might be low on backing pages. This change makes the worker now need
> > to wait for kswapd to give it pages. Consequently, if there are a lot of
> > allocations coming in when it's low, we might burn a bit of cpu from the
> > worker now.
> 
> Yes, this is potential side effect. On the other hand NOFS/NOIO requests
> wouldn't be considered atomic anymore and they wouldn't fail that
> easily. Maybe that is an odd case not worth the additional worker
> overhead. As I've said I am not familiar with the pcp internals to know
> how often the worker is really required
> 

I've thought about this in the back of my head for the past few weeks. I
think I have 2 questions about this change.

1. Back to what TJ said earlier about probing. I feel like GFP_KERNEL
   allocations should be okay because that more or less is control plane
   time? I'm not sure dropping PR_SET_IO_FLUSHER is all that big of a
   work around?

2. This change breaks the feedback loop as we discussed above.
   Historically we've targeted 2-4 free pages worth of percpu memory.
   This is done by kicking the percpu work off. That does GFP_KERNEL
   allocations and if that requires reclaim then it goes and does it.
   However, now we're saying kswapd is going to work in parallel while
   we try to get pages in the worker thread.

   Given you're more versed in the reclaim side. I presume it must be
   pretty bad if we're failing to get order-0 pages even if we have
   NOFS/NOIO set?

   My feeling is that we should add back some knowledge of the
   dependency so if the worker fails to get pages, it doesn't reschedule
   immediately. Maybe it's as simple as adding a sleep in the worker or
   playing with delayed work...

Thanks,
Dennis