Re: [RFC 00/12] module: avoid userspace pressure on unwanted allocations

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Fri, 24 Mar 2023 10:54:39 -0700

On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
> On 21.03.23 20:32, David Hildenbrand wrote:
> > On 20.03.23 22:27, Luis Chamberlain wrote:
> > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > > > 
> > > > Oh wow, so to clarify, it boots OK?
> > > > 
> > > 
> > > Now that we know that tree works, I'm curious also now if you can
> > > confirm just re-ordering the patches still works (it should)
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> > > 
> > 
> > So far no vmap errors booting the debug/kasan kernel (2 tries).

<-- snip -->

> > I think we primarily only care about systemd-udev-settle.service.
> > 
> > That is fastest without the rcu patch (~6s), compared to with the rcu
> > patch (~6.5s) and with stock (~7.5s -- 8s).
> > 
> > Looks like dracut-initqueue also might be a bit faster with your changes, but
> > maybe it's mostly noise (would have to do more runs).
> > 
> > So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> > helpful ...

Yes I confirm the RCU patch does not help at all now also using
stress-ng.

> Are there any other things you would like me to measure/test? I'll have to
> hand back that test machine soonish.

Yes please test the below. Perhaps its not the final form we want, but
it *does* fix OOM'ing when thrashing with stress-ng now with the module
option and even with 100 threads brings down max memory consumption by
259 MiB. The reason is that we also vmalloc during each finit_read_file()
for each module as well way before we even do layout_and_allocate(), and
so obviously if we fix the module path but not this path this will eventually
catch up with us as. I'm not at all happy with the current approach given
ideally we'd bump the counter when the user is done with the file, but we
don't yet have any tracking of that for users, they just vfree the memory
itself. And so this is just trying to catch heavy immediate abuse on the
caller to fend off abuse of vmalloc uses in a lightway manner.

There's gotta be a better way to do this, but its just an idea I have so far.
If we *want* to keep tabs until the user is done, we have to just modify
most users of these APIs and intrudce our own free. I don't think we're
in a rush to fix this so maybe that's the better approach.

And so I've managed to reproduce the issues you found now with my new stress-ng
module stressor as well.

https://github.com/ColinIanKing/stress-ng.git

Even though you have 400 CPUs with stress-ng we can likely reproduce it
with (use a module not loaded on your system):

./stress-ng --module 100 --module-name xfs

Without the patch below using 400 threads still OOMs easily due to the
kread issue. Max threads allowed are 8192.

>From ec5099b15bb74f154319034547184e81e4ad8ba0 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@xxxxxxxxxx>
Date: Fri, 24 Mar 2023 10:35:41 -0700
Subject: [PATCH] fs/kernel_read_file: add a clutch

Signed-off-by: Luis Chamberlain <mcgrof@xxxxxxxxxx>
---
 fs/kernel_read_file.c | 52 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..2d55abe73b21 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,10 +1,52 @@
 // SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "kread: " fmt
+
 #include <linux/fs.h>
 #include <linux/fs_struct.h>
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
 
+/*
+ * This clutch ensures we only allow a certain number concurrent threads at a
+ * time allocating space concurrently and they must all finish within the
+ * timeout specified.  Anything more we know we're thrashing.
+ */
+#define MAX_KREAD_CONCURRENT 20
+static atomic_t kread_concurrent_max = ATOMIC_INIT(MAX_KREAD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(kread_wq);
+
+/*
+ * How many seconds to wait for *all*  MAX_KREAD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_KREAD_ALL_BUSY_TIMEOUT 1
+
+static int kernel_read_check_concurrent(void)
+{
+	int ret;
+
+	if (atomic_dec_if_positive(&kread_concurrent_max) < 0) {
+		pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+				    atomic_read(&kread_concurrent_max),
+				    MAX_KREAD_CONCURRENT);
+		ret = wait_event_killable_timeout(kread_wq,
+						  atomic_dec_if_positive(&kread_concurrent_max) >= 0,
+						  MAX_KREAD_ALL_BUSY_TIMEOUT * HZ);
+		if (!ret) {
+			pr_warn_ratelimited("reading cannot be processed, kernel busy with %d threads reading files now for more than %d seconds",
+					    MAX_KREAD_CONCURRENT, MAX_KREAD_ALL_BUSY_TIMEOUT);
+			return -ETIME;
+		} else if (ret == -ERESTARTSYS) {
+			pr_warn_ratelimited("sigkill sent for kernel read, giving up");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * kernel_read_file() - read file contents into a kernel buffer
  *
@@ -68,10 +110,14 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 		goto out;
 	}
 
+	ret = kernel_read_check_concurrent();
+	if (ret)
+		goto out;
+
 	whole_file = (offset == 0 && i_size <= buf_size);
 	ret = security_kernel_read_file(file, id, whole_file);
 	if (ret)
-		goto out;
+		goto out_allow_new_read;
 
 	if (file_size)
 		*file_size = i_size;
@@ -117,7 +163,9 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 			*buf = NULL;
 		}
 	}
-
+out_allow_new_read:
+	atomic_inc(&kread_concurrent_max);
+	wake_up(&kread_wq);
 out:
 	allow_write_access(file);
 	return ret == 0 ? copied : ret;
-- 
2.39.2