Hi Alexey, Thank you for the patch! looks cool, we will try this patch for cutdown io operations during high memory pressure test.
and after check our vmcore, we can see our system io pressure under the swap_writepage and swap_readpage to under the shrink list operations.
On Tue, Apr 6, 2021 at 5:59 AM Alexey Avramov <hakavlad@xxxxxxxx> wrote:
> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule
> For the first time involving the swap part, there is no good way to fix
> the problem
The solution is protecting the clean file pages.
Look at this:
> On ChromiumOS, we do not use swap. When memory is low, the only
> way to free memory is to reclaim pages from the file list. This
> results in a lot of thrashing under low memory conditions. We see
> the system become unresponsive for minutes before it eventually OOMs.
> We also see very slow browser tab switching under low memory. Instead
> of an unresponsive system, we'd really like the kernel to OOM as soon
> as it starts to thrash. If it can't keep the working set in memory,
> then OOM. Losing one of many tabs is a better behaviour for the user
> than an unresponsive system.
> This patch create a new sysctl, min_filelist_kbytes, which disables
> reclaim of file-backed pages when when there are less than min_filelist_bytes
> worth of such pages in the cache. This tunable is handy for low memory
> systems using solid-state storage where interactive response is more important
> than not OOMing.
> With this patch and min_filelist_kbytes set to 50000, I see very little block
> layer activity during low memory. The system stays responsive under low
> memory and browser tab switching is fast. Eventually, a process a gets killed
> by OOM. Without this patch, the system gets wedged for minutes before it
> eventually OOMs.
— https://lore.kernel.org/patchwork/patch/222042/
This patch can almost completely eliminate thrashing under memory pressure.
Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.
Read more: https://github.com/hakavlad/le9-patch
The patch:
>From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov <hakavlad@xxxxxxxx>
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
amount of clean file cache
The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace
for normal operation. First of all, you need a cache of shared libraries
and executable files. If the volume of the CFP cache falls below a certain
level, thrashing and even livelock occurs.
Protection of CFP may be used to prevent thrashing and reducing I/O under
memory pressure. Hard protection of CFP may be used to avoid high latency
and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory
pressure.
The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
CFP. The CFP on the current node won't be reclaimed uder memory pressure
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
or have no swap space or vm.swappiness=0. Setting it to a high value may
result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The
default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).
The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
CFP on the current node won't be reclaimed under memory pressure when their
volume is below vm.clean_min_kbytes. Setting it to a high value may result
in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The
default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).
Reported-by: Artem S. Tashkinov <aros@xxxxxxx>
Signed-off-by: Alexey Avramov <hakavlad@xxxxxxxx>
---
Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++
include/linux/mm.h | 3 ++
kernel/sysctl.c | 14 ++++++++
mm/Kconfig | 35 +++++++++++++++++++
mm/vmscan.c | 59 +++++++++++++++++++++++++++++++++
5 files changed, 148 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index f455fa00c..5d5ddfc85 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:
- admin_reserve_kbytes
- block_dump
+- clean_low_kbytes
+- clean_min_kbytes
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
@@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
+clean_low_kbytes
+=====================
+
+This knob provides *best-effort* protection of clean file pages. The clean file
+pages on the current node won't be reclaimed uder memory pressure when their
+volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no
+swap space or vm.swappiness=0.
+
+Protection of clean file pages may be used to prevent thrashing and
+reducing I/O under low-memory conditions.
+
+Setting it to a high value may result in a early eviction of anonymous pages
+into the swap space by attempting to hold the protected amount of clean file
+pages in memory.
+
+The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
+
+
+clean_min_kbytes
+=====================
+
+This knob provides *hard* protection of clean file pages. The clean file pages
+on the current node won't be reclaimed under memory pressure when their volume
+is below vm.clean_min_kbytes.
+
+Hard protection of clean file pages may be used to avoid high latency and
+prevent livelock in near-OOM conditions.
+
+Setting it to a high value may result in a early out-of-memory condition due to
+the inability to reclaim the protected amount of clean file pages when other
+types of pages cannot be reclaimed.
+
+The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
+
+
compact_memory
==============
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3f..7799f1555 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page *page)
extern int sysctl_max_map_count;
+extern unsigned long sysctl_clean_low_kbytes;
+extern unsigned long sysctl_clean_min_kbytes;
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afad08596..854b311cd 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] = {
},
#endif
{
+ .procname = "clean_low_kbytes",
+ .data = &sysctl_clean_low_kbytes,
+ .maxlen = sizeof(sysctl_clean_low_kbytes),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+ {
+ .procname = "clean_min_kbytes",
+ .data = &sysctl_clean_min_kbytes,
+ .maxlen = sizeof(sysctl_clean_min_kbytes),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ },
+ {
.procname = "user_reserve_kbytes",
.data = &sysctl_user_reserve_kbytes,
.maxlen = sizeof(sysctl_user_reserve_kbytes),
diff --git a/mm/Kconfig b/mm/Kconfig
index 390165ffb..3915c71e1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.
+config CLEAN_LOW_KBYTES
+ int "Default value for vm.clean_low_kbytes"
+ depends on SYSCTL
+ default "0"
+ help
+ The vm.clean_file_low_kbytes sysctl knob provides *best-effort*
+ protection of clean file pages. The clean file pages on the current
+ node won't be reclaimed uder memory pressure when their volume is
+ below vm.clean_low_kbytes *unless* we threaten to OOM or have
+ no swap space or vm.swappiness=0.
+
+ Protection of clean file pages may be used to prevent thrashing and
+ reducing I/O under low-memory conditions.
+
+ Setting it to a high value may result in a early eviction of anonymous
+ pages into the swap space by attempting to hold the protected amount of
+ clean file pages in memory.
+
+config CLEAN_MIN_KBYTES
+ int "Default value for vm.clean_min_kbytes"
+ depends on SYSCTL
+ default "0"
+ help
+ The vm.clean_file_min_kbytes sysctl knob provides *hard* protection
+ of clean file pages. The clean file pages on the current node won't be
+ reclaimed under memory pressure when their volume is below
+ vm.clean_min_kbytes.
+
+ Hard protection of clean file pages may be used to avoid high latency and
+ prevent livelock in near-OOM conditions.
+
+ Setting it to a high value may result in a early out-of-memory condition
+ due to the inability to reclaim the protected amount of clean file pages
+ when other types of pages cannot be reclaimed.
+
config HAVE_MEMBLOCK_PHYS_MAP
bool
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b4e31eac..77e98c43e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -120,6 +120,19 @@ struct scan_control {
/* The file pages on the current node are dangerously low */
unsigned int file_is_tiny:1;
+ /*
+ * The clean file pages on the current node won't be reclaimed when
+ * their volume is below vm.clean_low_kbytes *unless* we threaten
+ * to OOM or have no swap space or vm.swappiness=0.
+ */
+ unsigned int clean_below_low:1;
+
+ /*
+ * The clean file pages on the current node won't be reclaimed when
+ * their volume is below vm.clean_min_kbytes.
+ */
+ unsigned int clean_below_min:1;
+
/* Allocation order */
s8 order;
@@ -166,6 +179,17 @@ struct scan_control {
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif
+#if CONFIG_CLEAN_LOW_KBYTES < 0
+#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
+#endif
+
+#if CONFIG_CLEAN_MIN_KBYTES < 0
+#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
+#endif
+
+unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
+unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
+
/*
* From 0 .. 200. Higher means more swappy.
*/
@@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
}
/*
+ * Force-scan anon if clean file pages is under vm.clean_min_kbytes
+ * or vm.clean_low_kbytes (unless the swappiness setting
+ * disagrees with swapping).
+ */
+ if ((sc->clean_below_low || sc->clean_below_min) && swappiness) {
+ scan_balance = SCAN_ANON;
+ goto out;
+ }
+
+ /*
* If there is enough inactive page cache, we do not reclaim
* anything from the anonymous working right now.
*/
@@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
BUG();
}
+ /*
+ * Don't reclaim clean file pages when their volume is below
+ * vm.clean_min_kbytes.
+ */
+ if (file && sc->clean_below_min)
+ scan = 0;
+
nr[lru] = scan;
}
}
@@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
anon >> sc->priority;
}
+ if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
+ unsigned long reclaimable_file, dirty, clean;
+
+ reclaimable_file =
+ node_page_state(pgdat, NR_ACTIVE_FILE) +
+ node_page_state(pgdat, NR_INACTIVE_FILE) +
+ node_page_state(pgdat, NR_ISOLATED_FILE);
+ dirty = node_page_state(pgdat, NR_FILE_DIRTY);
+ if (reclaimable_file > dirty)
+ clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);
+
+ sc->clean_below_low = clean < sysctl_clean_low_kbytes;
+ sc->clean_below_min = clean < sysctl_clean_min_kbytes;
+ } else {
+ sc->clean_below_low = false;
+ sc->clean_below_min = false;
+ }
+
shrink_node_memcgs(pgdat, sc);
if (reclaim_state) {
--
2.11.0