The patch titled Subject: mm: zero-seek shrinkers has been added to the -mm tree. Its filename is mm-zero-seek-shrinkers.patch This patch should soon appear at http://ozlabs.org/~akpm/mmots/broken-out/mm-zero-seek-shrinkers.patch and later at http://ozlabs.org/~akpm/mmotm/broken-out/mm-zero-seek-shrinkers.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next and is updated there every 3-4 working days ------------------------------------------------------ From: Johannes Weiner <hannes@xxxxxxxxxxx> Subject: mm: zero-seek shrinkers The page cache and most shrinkable slab caches hold data that has been read from disk, but there are some caches that only cache CPU work, such as the dentry and inode caches of procfs and sysfs, as well as the subset of radix tree nodes that track non-resident page cache. Currently, all these are shrunk at the same rate: using DEFAULT_SEEKS for the shrinker's seeks setting tells the reclaim algorithm that for every two page cache pages scanned it should scan one slab object. This is a bogus setting. A virtual inode that required no IO to create is not twice as valuable as a page cache page; shadow cache entries with eviction distances beyond the size of memory aren't either. In most cases, the behavior in practice is still fine. Such virtual caches don't tend to grow and assert themselves aggressively, and usually get picked up before they cause problems. But there are scenarios where that's not true. Our database workloads suffer from two of those. For one, their file workingset is several times bigger than available memory, which has the kernel aggressively create shadow page cache entries for the non-resident parts of it. The workingset code does tell the VM that most of these are expendable, but the VM ends up balancing them 2:1 to cache pages as per the seeks setting. This is a huge waste of memory. These workloads also deal with tens of thousands of open files and use /proc for introspection, which ends up growing the proc_inode_cache to absurdly large sizes - again at the cost of valuable cache space, which isn't a reasonable trade-off, given that proc inodes can be re-created without involving the disk. This patch implements a "zero-seek" setting for shrinkers that results in a target ratio of 0:1 between their objects and IO-backed caches. This allows such virtual caches to grow when memory is available (they do cache/avoid CPU work after all), but effectively disables them as soon as IO-backed objects are under pressure. It then switches the shrinkers for procfs and sysfs metadata, as well as excess page cache shadow nodes, to the new zero-seek setting. Link: http://lkml.kernel.org/r/20181009184732.762-5-hannes@xxxxxxxxxxx Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx> Reported-by: Domas Mituzas <dmituzas@xxxxxx> Reviewed-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> Cc: Rik van Riel <riel@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- diff -puN fs/kernfs/mount.c~mm-zero-seek-shrinkers fs/kernfs/mount.c --- a/fs/kernfs/mount.c~mm-zero-seek-shrinkers +++ a/fs/kernfs/mount.c @@ -236,6 +236,9 @@ static int kernfs_fill_super(struct supe sb->s_export_op = &kernfs_export_ops; sb->s_time_gran = 1; + /* sysfs dentries and inodes don't require IO to create */ + sb->s_shrink.seeks = 0; + /* get root inode, initialize and unlock it */ mutex_lock(&kernfs_mutex); inode = kernfs_get_inode(sb, info->root->kn); diff -puN fs/proc/root.c~mm-zero-seek-shrinkers fs/proc/root.c diff -puN mm/vmscan.c~mm-zero-seek-shrinkers mm/vmscan.c --- a/mm/vmscan.c~mm-zero-seek-shrinkers +++ a/mm/vmscan.c @@ -474,9 +474,18 @@ static unsigned long do_shrink_slab(stru nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0); total_scan = nr; - delta = freeable >> priority; - delta *= 4; - do_div(delta, shrinker->seeks); + if (shrinker->seeks) { + delta = freeable >> priority; + delta *= 4; + do_div(delta, shrinker->seeks); + } else { + /* + * These objects don't require any IO to create. Trim + * them aggressively under memory pressure to keep + * them from causing refetches in the IO caches. + */ + delta = freeable / 2; + } /* * Make sure we apply some minimal pressure on default priority diff -puN mm/workingset.c~mm-zero-seek-shrinkers mm/workingset.c --- a/mm/workingset.c~mm-zero-seek-shrinkers +++ a/mm/workingset.c @@ -534,7 +534,7 @@ static unsigned long scan_shadow_nodes(s static struct shrinker workingset_shadow_shrinker = { .count_objects = count_shadow_nodes, .scan_objects = scan_shadow_nodes, - .seeks = DEFAULT_SEEKS, + .seeks = 0, /* ->count reports only fully expendable nodes */ .flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, }; diff -puN fs/proc/inode.c~mm-zero-seek-shrinkers fs/proc/inode.c --- a/fs/proc/inode.c~mm-zero-seek-shrinkers +++ a/fs/proc/inode.c @@ -516,6 +516,9 @@ int proc_fill_super(struct super_block * */ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; + /* procfs dentries and inodes don't require IO to create */ + s->s_shrink.seeks = 0; + pde_get(&proc_root); root_inode = proc_get_inode(s, &proc_root); if (!root_inode) { _ Patches currently in -mm which might be from hannes@xxxxxxxxxxx are mm-workingset-dont-drop-refault-information-prematurely-fix.patch mm-workingset-tell-cache-transitions-from-workingset-thrashing.patch delayacct-track-delays-from-thrashing-cache-pages.patch sched-loadavg-consolidate-load_int-load_frac-calc_load.patch sched-loadavg-consolidate-load_int-load_frac-calc_load-fix-fix.patch sched-loadavg-make-calc_load_n-public.patch sched-schedh-make-rq-locking-and-clock-functions-available-in-statsh.patch sched-introduce-this_rq_lock_irq.patch psi-pressure-stall-information-for-cpu-memory-and-io.patch psi-pressure-stall-information-for-cpu-memory-and-io-fix.patch psi-pressure-stall-information-for-cpu-memory-and-io-fix-2.patch psi-pressure-stall-information-for-cpu-memory-and-io-fix-3.patch psi-pressure-stall-information-for-cpu-memory-and-io-fix-4.patch psi-cgroup-support.patch mm-workingset-use-cheaper-__inc_lruvec_state-in-irqsafe-node-reclaim.patch mm-workingset-add-vmstat-counter-for-shadow-nodes.patch mm-zero-seek-shrinkers.patch