On Mon, Feb 26, 2024 at 12:17 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Sun, Feb 25, 2024 at 07:42:04PM +0800, Yafang Shao wrote: > > In our container environment, we've observed that certain containers may > > accumulate more than 40GB of slabs, predominantly negative dentries. These > > negative dentries remain unreclaimed unless there is memory pressure. Even > > after the containers exit, these negative dentries persist. To manage disk > > storage efficiently, we employ an agent that identifies container images > > eligible for destruction once all instances of that image exit. > > I understand why you've written this patch, but we really do need to fix > this for non-container workloads. See also: > > https://lore.kernel.org/all/20220402072103.5140-1-hdanton@xxxxxxxx/ > > https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@xxxxxxxxxx/ > > https://lore.kernel.org/all/YjDvRPuxPN0GsxLB@xxxxxxxxxxxxxxxxxxxx/ > > I'm sure theer have been many other threads on this over the years. Thank you for sharing your insights. I've reviewed the proposals and related discussions. It appears that a consensus has not yet been reached on how to tackle the issue. While I may not fully comprehend all aspects of the discussions, it seems that the challenges stemming from slab shrinking can be distilled into four key questions: - When should the shrinker be triggered? - Which task is responsible for performing the shrinking? - Which slab should be reclaimed? - How many slabs should be reclaimed? Addressing all these questions within the kernel might introduce unnecessary complexity. Instead, one potential approach could be to extend the functionality of memory.reclaim or introduce a new interface, such as memory.shrinker, and delegate decision-making to userspace based on the workload. Since memory.reclaim is also supported in the root memcg, it can effectively address issues outside of container environments. Here's a rough idea, which needs validation: 1. Expose detailed shrinker information via debugfs We've already exposed details of the slab through /sys/kernel/debug/slab, so extending this to include shrinker details shouldn't be too challenging. For example, for the dentry shrinker, we could expose /sys/kernel/debug/shrinker/super_cache_scan/{shrinker_id, kmem_cache, ...}. 2. Shrink specific slabs with a specific count This could be implemented by extending memory.reclaim with parameters like "shrinker_id=" and "scan_count=". Currently, memory.reclaim is byte-based, which isn't ideal for shrinkers due to the deferred freeing of slabs. Using scan_count to specify the number of slabs to reclaim could be more effective. These are preliminary ideas, and I welcome any feedback. Additionally, since this patch offers a straightforward solution to address the issue in container environments, would it be feasible to apply this patch initially? -- Regards Yafang