Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

Borislav Petkov <bp@xxxxxxxxx> · Wed, 19 Feb 2025 11:55:03 +0100

On Fri, Jan 10, 2025 at 06:40:29PM +0000, Brendan Jackman wrote:
Subject: Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API

x86/asi: ...

Introduce core API for Address Space Isolation (ASI).  Kernel address
space isolation provides the ability to run some kernel
code with a restricted kernel address space.

There can be multiple classes of such restricted kernel address spaces
(e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index.
The ASI class can register some hooks to be called when
entering/exiting the restricted address space.

Currently, there is a fixed maximum number of ASI classes supported.
In addition, each process can have at most one restricted address space
from each ASI class. Neither of these are inherent limitations and
are merely simplifying assumptions for the time being.

To keep things simpler for the time being, we disallow context switches

Please use passive voice in your commit message: no "we" or "I", etc,
and describe your changes in imperative mood.

Also, see section "Changelog" in
Documentation/process/maintainer-tip.rst

within the restricted address space. In the future, we would be able to
relax this limitation for the case of context switches to different
threads within the same process (or to the idle thread and back).

Note that this doesn't really support protecting sibling VM guests
within the same VMM process from one another. From first principles
it seems unlikely that anyone who cares about VM isolation would do
that, but there could be a use-case to think about. In that case need
something like the OTHER_MM logic might be needed, but specific to
intra-process guest separation.

[0]:
https://lore.kernel.org/kvm/1562855138-19507-1-git-send-email-alexandre.chartre@xxxxxxxxxx

Notes about RFC-quality implementation details:

 - Ignoring checkpatch.pl AVOID_BUG.
 - The dynamic registration of classes might be pointless complexity.
   This was kept from RFCv1 without much thought.
 - The other-mm logic is also perhaps overly complex, suggestions are
   welcome for how best to tackle this (or we could just forget about
   it for the moment, and rely on asi_exit() happening in process
   switch).
 - The taint flag definitions would probably be clearer with an enum or
   something.

Checkpatch-args: --ignore=AVOID_BUG,COMMIT_LOG_LONG_LINE,EXPORT_SYMBOL
Co-developed-by: Ofir Weisse <oweisse@xxxxxxxxxx>
Signed-off-by: Ofir Weisse <oweisse@xxxxxxxxxx>
Co-developed-by: Junaid Shahid <junaids@xxxxxxxxxx>
Signed-off-by: Junaid Shahid <junaids@xxxxxxxxxx>
Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
---
 arch/x86/include/asm/asi.h       | 208 +++++++++++++++++++++++
 arch/x86/include/asm/processor.h |   8 +
 arch/x86/mm/Makefile             |   1 +
 arch/x86/mm/asi.c                | 350 +++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/init.c               |   3 +-
 arch/x86/mm/tlb.c                |   1 +
 include/asm-generic/asi.h        |  67 ++++++++
 include/linux/mm_types.h         |   7 +
 kernel/fork.c                    |   3 +
 kernel/sched/core.c              |   9 +
 mm/init-mm.c                     |   4 +
 11 files changed, 660 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h
new file mode 100644
index 0000000000000000000000000000000000000000..7cc635b6653a3970ba9dbfdc9c828a470e27bd44
--- /dev/null
+++ b/arch/x86/include/asm/asi.h
@@ -0,0 +1,208 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_ASI_H
+#define _ASM_X86_ASI_H
+
+#include <linux/sched.h>
+
+#include <asm-generic/asi.h>
+
+#include <asm/pgtable_types.h>
+#include <asm/percpu.h>
+#include <asm/processor.h>
+
+#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
+
+/*
+ * Overview of API usage by ASI clients:
+ *
+ * Setup: First call asi_init() to create a domain. At present only one domain
+ * can be created per mm per class, but it's safe to asi_init() this domain
+ * multiple times. For each asi_init() call you must call asi_destroy() AFTER
+ * you are certain all CPUs have exited the restricted address space (by
+ * calling asi_exit()).
+ *
+ * Runtime usage:
+ *
+ * 1. Call asi_enter() to switch to the restricted address space. This can't be
+ *    from an interrupt or exception handler and preemption must be disabled.
+ *
+ * 2. Execute untrusted code.
+ *
+ * 3. Call asi_relax() to inform the ASI subsystem that untrusted code execution
+ *    is finished. This doesn't cause any address space change. This can't be
+ *    from an interrupt or exception handler and preemption must be disabled.
+ *
+ * 4. Either:
+ *
+ *    a. Go back to 1.
+ *
+ *    b. Call asi_exit() before returning to userspace. This immediately
+ *       switches to the unrestricted address space.

So only from reading this, it does sound weird. Maybe the code does it
differently - I'll see soon - but this basically says:

I asi_enter(), do something, asi_relax() and then I decide to do something
more and to asi_enter() again!? And then I can end it all with a *single*
asi_exit() call?

Hm, definitely weird API. Why?

/*
 * Leave the "tense" state if we are in it, i.e. end the critical section. We
 * will stay relaxed until the next asi_enter.
 */
void asi_relax(void);

Yeah, so there's no API functions balance between enter() and relax()...

+ *
+ * The region between 1 and 3 is called the "ASI critical section". During the
+ * critical section, it is a bug to access any sensitive data, and you mustn't
+ * sleep.
+ *
+ * The restriction on sleeping is not really a fundamental property of ASI.
+ * However for performance reasons it's important that the critical section is
+ * absolutely as short as possible. So the ability to do sleepy things like
+ * taking mutexes oughtn't to confer any convenience on API users.
+ *
+ * Similarly to the issue of sleeping, the need to asi_exit in case 4b is not a
+ * fundamental property of the system but a limitation of the current
+ * implementation. With further work it is possible to context switch
+ * from and/or to the restricted address space, and to return to userspace
+ * directly from the restricted address space, or _in_ it.
+ *
+ * Note that the critical section only refers to the direct execution path from
+ * asi_enter to asi_relax: it's fine to access sensitive data from exceptions
+ * and interrupt handlers that occur during that time. ASI will re-enter the
+ * restricted address space before returning from the outermost
+ * exception/interrupt.
+ *
+ * Note: ASI does not modify KPTI behaviour; when ASI and KPTI run together
+ * there are 2+N address spaces per task: the unrestricted kernel address space,
+ * the user address space, and one restricted (kernel) address space for each of
+ * the N ASI classes.
+ */
+
+/*
+ * ASI uses a per-CPU tainting model to track what mitigation actions are
+ * required on domain transitions. Taints exist along two dimensions:
+ *
+ *  - Who touched the CPU (guest, unprotected kernel, userspace).
+ *
+ *  - What kind of state might remain: "data" means there might be data owned by
+ *    a victim domain left behind in a sidechannel. "Control" means there might
+ *    be state controlled by an attacker domain left behind in the branch
+ *    predictor.
+ *
+ *    In principle the same domain can be both attacker and victim, thus we have
+ *    both data and control taints for userspace, although there's no point in
+ *    trying to protect against attacks from the kernel itself, so there's no
+ *    ASI_TAINT_KERNEL_CONTROL.
+ */
+#define ASI_TAINT_KERNEL_DATA		((asi_taints_t)BIT(0))
+#define ASI_TAINT_USER_DATA		((asi_taints_t)BIT(1))
+#define ASI_TAINT_GUEST_DATA		((asi_taints_t)BIT(2))
+#define ASI_TAINT_OTHER_MM_DATA		((asi_taints_t)BIT(3))
+#define ASI_TAINT_USER_CONTROL		((asi_taints_t)BIT(4))
+#define ASI_TAINT_GUEST_CONTROL		((asi_taints_t)BIT(5))
+#define ASI_TAINT_OTHER_MM_CONTROL	((asi_taints_t)BIT(6))
+#define ASI_NUM_TAINTS			6
+static_assert(BITS_PER_BYTE * sizeof(asi_taints_t) >= ASI_NUM_TAINTS);

Why is this a typedef at all to make the code more unreadable than it needs to
be? Why not a simple unsigned int or char or whatever you need?

+
+#define ASI_TAINTS_CONTROL_MASK \
+	(ASI_TAINT_USER_CONTROL | ASI_TAINT_GUEST_CONTROL | ASI_TAINT_OTHER_MM_CONTROL)
+
+#define ASI_TAINTS_DATA_MASK \
+	(ASI_TAINT_KERNEL_DATA | ASI_TAINT_USER_DATA | ASI_TAINT_OTHER_MM_DATA)
+
+#define ASI_TAINTS_GUEST_MASK (ASI_TAINT_GUEST_DATA | ASI_TAINT_GUEST_CONTROL)
+#define ASI_TAINTS_USER_MASK (ASI_TAINT_USER_DATA | ASI_TAINT_USER_CONTROL)
+
+/* The taint policy tells ASI how a class interacts with the CPU taints */
+struct asi_taint_policy {
+	/*
+	 * What taints would necessitate a flush when entering the domain, to
+	 * protect it from attack by prior domains?
+	 */
+	asi_taints_t prevent_control;

So if those necessitate a flush, why isn't this var called "taints_to_flush"
or whatever which exactly explains what it is?

+	/*
+	 * What taints would necessetate a flush when entering the domain, to

+	 * What taints would necessetate a flush when entering the domain, to
Unknown word [necessetate] in comment.
Suggestions: ['necessitate',

Spellchecker please. Go over your whole set.

+	 * protect former domains from attack by this domain?
+	 */
+	asi_taints_t protect_data;

Same.

+	/* What taints should be set when entering the domain? */
+	asi_taints_t set;


So "required_taints" or so... hm?

+};
+
+/*
+ * An ASI domain (struct asi) represents a restricted address space. The

no need for "(struct asi)" - it is right below :).

+ * unrestricted address space (and user address space under PTI) are not
+ * represented as a domain.
+ */
+struct asi {
+	pgd_t *pgd;
+	struct mm_struct *mm;
+	int64_t ref_count;
+	enum asi_class_id class_id;
+};
+
+DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi);

Or simply "asi" - this per-CPU var will be so prominent so that when you do
"per_cpu(asi)" you know what exactly it is



+
+void asi_init_mm_state(struct mm_struct *mm);
+
+int asi_init_class(enum asi_class_id class_id, struct asi_taint_policy *taint_policy);
+void asi_uninit_class(enum asi_class_id class_id);

"uninit", meh. "exit" perhaps? or "destroy"?

And you have "asi_destroy" already so I guess you can do:

asi_class_init()
asi_class_destroy()

to have the namespace correct.

+const char *asi_class_name(enum asi_class_id class_id);
+
+int asi_init(struct mm_struct *mm, enum asi_class_id class_id, struct asi **out_asi);
+void asi_destroy(struct asi *asi);
+
+/* Enter an ASI domain (restricted address space) and begin the critical section. */
+void asi_enter(struct asi *asi);
+
+/*
+ * Leave the "tense" state if we are in it, i.e. end the critical section. We
+ * will stay relaxed until the next asi_enter.
+ */
+void asi_relax(void);
+
+/* Immediately exit the restricted address space if in it */
+void asi_exit(void);
+
+/* The target is the domain we'll enter when returning to process context. */
+static __always_inline struct asi *asi_get_target(struct task_struct *p)
+{
+	return p->thread.asi_state.target;
+}
+
+static __always_inline void asi_set_target(struct task_struct *p,
+					   struct asi *target)
+{
+	p->thread.asi_state.target = target;
+}
+
+static __always_inline struct asi *asi_get_current(void)
+{
+	return this_cpu_read(curr_asi);
+}
+
+/* Are we currently in a restricted address space? */
+static __always_inline bool asi_is_restricted(void)
+{
+	return (bool)asi_get_current();
+}
+
+/* If we exit/have exited, can we stay that way until the next asi_enter? */
+static __always_inline bool asi_is_relaxed(void)
+{
+	return !asi_get_target(current);
+}
+
+/*
+ * Is the current task in the critical section?
+ *
+ * This is just the inverse of !asi_is_relaxed(). We have both functions in order to
+ * help write intuitive client code. In particular, asi_is_tense returns false
+ * when ASI is disabled, which is judged to make user code more obvious.
+ */
+static __always_inline bool asi_is_tense(void)
+{
+	return !asi_is_relaxed();
+}

So can we tone down the silly helpers above? You don't really need
asi_is_tense() for example. It is still very intuitive if I read

	if (!asi_is_relaxed())

...

+
+static __always_inline pgd_t *asi_pgd(struct asi *asi)
+{
+	return asi ? asi->pgd : NULL;
+}
+
+#define INIT_MM_ASI(init_mm) \
+	.asi_init_lock = __MUTEX_INITIALIZER(init_mm.asi_init_lock),
+
+void asi_handle_switch_mm(void);
+
+#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */
+
+#endif

Splitting the patch here and will continue with the next one as this one is
kinda big for one mail.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette