Re: [RFC PATCH] mm: Add reclaim type to memory.reclaim

Yafang Shao <laoar.shao@xxxxxxxxx> · Mon, 26 Feb 2024 20:34:07 +0800

On Mon, Feb 26, 2024 at 12:17 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Sun, Feb 25, 2024 at 07:42:04PM +0800, Yafang Shao wrote:
> > In our container environment, we've observed that certain containers may
> > accumulate more than 40GB of slabs, predominantly negative dentries. These
> > negative dentries remain unreclaimed unless there is memory pressure. Even
> > after the containers exit, these negative dentries persist. To manage disk
> > storage efficiently, we employ an agent that identifies container images
> > eligible for destruction once all instances of that image exit.
>
> I understand why you've written this patch, but we really do need to fix
> this for non-container workloads.  See also:
>
> https://lore.kernel.org/all/20220402072103.5140-1-hdanton@xxxxxxxx/
>
> https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@xxxxxxxxxx/
>
> https://lore.kernel.org/all/YjDvRPuxPN0GsxLB@xxxxxxxxxxxxxxxxxxxx/
>
> I'm sure theer have been many other threads on this over the years.

Thank you for sharing your insights. I've reviewed the proposals and
related discussions. It appears that a consensus has not yet been
reached on how to tackle the issue. While I may not fully comprehend
all aspects of the discussions, it seems that the challenges stemming
from slab shrinking can be distilled into four key questions:

- When should the shrinker be triggered?
- Which task is responsible for performing the shrinking?
- Which slab should be reclaimed?
- How many slabs should be reclaimed?

Addressing all these questions within the kernel might introduce
unnecessary complexity. Instead, one potential approach could be to
extend the functionality of memory.reclaim or introduce a new
interface, such as memory.shrinker, and delegate decision-making to
userspace based on the workload. Since memory.reclaim is also
supported in the root memcg, it can effectively address issues outside
of container environments. Here's a rough idea, which needs
validation:

1. Expose detailed shrinker information via debugfs
We've already exposed details of the slab through
/sys/kernel/debug/slab, so extending this to include shrinker details
shouldn't be too challenging. For example, for the dentry shrinker, we
could expose /sys/kernel/debug/shrinker/super_cache_scan/{shrinker_id,
kmem_cache, ...}.

2. Shrink specific slabs with a specific count
This could be implemented by extending memory.reclaim with parameters
like "shrinker_id=" and "scan_count=". Currently, memory.reclaim is
byte-based, which isn't ideal for shrinkers due to the deferred
freeing of slabs. Using scan_count to specify the number of slabs to
reclaim could be more effective.

These are preliminary ideas, and I welcome any feedback.

Additionally, since this patch offers a straightforward solution to
address the issue in container environments, would it be feasible to
apply this patch initially?

-- 
Regards
Yafang