On Fri, Jan 10, 2025 at 06:40:29PM +0000, Brendan Jackman wrote: > Subject: Re: [PATCH RFC v2 03/29] mm: asi: Introduce ASI core API x86/asi: ... > Introduce core API for Address Space Isolation (ASI). Kernel address > space isolation provides the ability to run some kernel > code with a restricted kernel address space. > > There can be multiple classes of such restricted kernel address spaces > (e.g. KPTI, KVM-PTI etc.). Each ASI class is identified by an index. > The ASI class can register some hooks to be called when > entering/exiting the restricted address space. > > Currently, there is a fixed maximum number of ASI classes supported. > In addition, each process can have at most one restricted address space > from each ASI class. Neither of these are inherent limitations and > are merely simplifying assumptions for the time being. > > To keep things simpler for the time being, we disallow context switches Please use passive voice in your commit message: no "we" or "I", etc, and describe your changes in imperative mood. Also, see section "Changelog" in Documentation/process/maintainer-tip.rst > within the restricted address space. In the future, we would be able to > relax this limitation for the case of context switches to different > threads within the same process (or to the idle thread and back). > > Note that this doesn't really support protecting sibling VM guests > within the same VMM process from one another. From first principles > it seems unlikely that anyone who cares about VM isolation would do > that, but there could be a use-case to think about. In that case need > something like the OTHER_MM logic might be needed, but specific to > intra-process guest separation. > > [0]: > https://lore.kernel.org/kvm/1562855138-19507-1-git-send-email-alexandre.chartre@xxxxxxxxxx > > Notes about RFC-quality implementation details: > > - Ignoring checkpatch.pl AVOID_BUG. > - The dynamic registration of classes might be pointless complexity. > This was kept from RFCv1 without much thought. > - The other-mm logic is also perhaps overly complex, suggestions are > welcome for how best to tackle this (or we could just forget about > it for the moment, and rely on asi_exit() happening in process > switch). > - The taint flag definitions would probably be clearer with an enum or > something. > > Checkpatch-args: --ignore=AVOID_BUG,COMMIT_LOG_LONG_LINE,EXPORT_SYMBOL > Co-developed-by: Ofir Weisse <oweisse@xxxxxxxxxx> > Signed-off-by: Ofir Weisse <oweisse@xxxxxxxxxx> > Co-developed-by: Junaid Shahid <junaids@xxxxxxxxxx> > Signed-off-by: Junaid Shahid <junaids@xxxxxxxxxx> > Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx> > --- > arch/x86/include/asm/asi.h | 208 +++++++++++++++++++++++ > arch/x86/include/asm/processor.h | 8 + > arch/x86/mm/Makefile | 1 + > arch/x86/mm/asi.c | 350 +++++++++++++++++++++++++++++++++++++++ > arch/x86/mm/init.c | 3 +- > arch/x86/mm/tlb.c | 1 + > include/asm-generic/asi.h | 67 ++++++++ > include/linux/mm_types.h | 7 + > kernel/fork.c | 3 + > kernel/sched/core.c | 9 + > mm/init-mm.c | 4 + > 11 files changed, 660 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/asi.h b/arch/x86/include/asm/asi.h > new file mode 100644 > index 0000000000000000000000000000000000000000..7cc635b6653a3970ba9dbfdc9c828a470e27bd44 > --- /dev/null > +++ b/arch/x86/include/asm/asi.h > @@ -0,0 +1,208 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _ASM_X86_ASI_H > +#define _ASM_X86_ASI_H > + > +#include <linux/sched.h> > + > +#include <asm-generic/asi.h> > + > +#include <asm/pgtable_types.h> > +#include <asm/percpu.h> > +#include <asm/processor.h> > + > +#ifdef CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION > + > +/* > + * Overview of API usage by ASI clients: > + * > + * Setup: First call asi_init() to create a domain. At present only one domain > + * can be created per mm per class, but it's safe to asi_init() this domain > + * multiple times. For each asi_init() call you must call asi_destroy() AFTER > + * you are certain all CPUs have exited the restricted address space (by > + * calling asi_exit()). > + * > + * Runtime usage: > + * > + * 1. Call asi_enter() to switch to the restricted address space. This can't be > + * from an interrupt or exception handler and preemption must be disabled. > + * > + * 2. Execute untrusted code. > + * > + * 3. Call asi_relax() to inform the ASI subsystem that untrusted code execution > + * is finished. This doesn't cause any address space change. This can't be > + * from an interrupt or exception handler and preemption must be disabled. > + * > + * 4. Either: > + * > + * a. Go back to 1. > + * > + * b. Call asi_exit() before returning to userspace. This immediately > + * switches to the unrestricted address space. So only from reading this, it does sound weird. Maybe the code does it differently - I'll see soon - but this basically says: I asi_enter(), do something, asi_relax() and then I decide to do something more and to asi_enter() again!? And then I can end it all with a *single* asi_exit() call? Hm, definitely weird API. Why? /* * Leave the "tense" state if we are in it, i.e. end the critical section. We * will stay relaxed until the next asi_enter. */ void asi_relax(void); Yeah, so there's no API functions balance between enter() and relax()... > + * > + * The region between 1 and 3 is called the "ASI critical section". During the > + * critical section, it is a bug to access any sensitive data, and you mustn't > + * sleep. > + * > + * The restriction on sleeping is not really a fundamental property of ASI. > + * However for performance reasons it's important that the critical section is > + * absolutely as short as possible. So the ability to do sleepy things like > + * taking mutexes oughtn't to confer any convenience on API users. > + * > + * Similarly to the issue of sleeping, the need to asi_exit in case 4b is not a > + * fundamental property of the system but a limitation of the current > + * implementation. With further work it is possible to context switch > + * from and/or to the restricted address space, and to return to userspace > + * directly from the restricted address space, or _in_ it. > + * > + * Note that the critical section only refers to the direct execution path from > + * asi_enter to asi_relax: it's fine to access sensitive data from exceptions > + * and interrupt handlers that occur during that time. ASI will re-enter the > + * restricted address space before returning from the outermost > + * exception/interrupt. > + * > + * Note: ASI does not modify KPTI behaviour; when ASI and KPTI run together > + * there are 2+N address spaces per task: the unrestricted kernel address space, > + * the user address space, and one restricted (kernel) address space for each of > + * the N ASI classes. > + */ > + > +/* > + * ASI uses a per-CPU tainting model to track what mitigation actions are > + * required on domain transitions. Taints exist along two dimensions: > + * > + * - Who touched the CPU (guest, unprotected kernel, userspace). > + * > + * - What kind of state might remain: "data" means there might be data owned by > + * a victim domain left behind in a sidechannel. "Control" means there might > + * be state controlled by an attacker domain left behind in the branch > + * predictor. > + * > + * In principle the same domain can be both attacker and victim, thus we have > + * both data and control taints for userspace, although there's no point in > + * trying to protect against attacks from the kernel itself, so there's no > + * ASI_TAINT_KERNEL_CONTROL. > + */ > +#define ASI_TAINT_KERNEL_DATA ((asi_taints_t)BIT(0)) > +#define ASI_TAINT_USER_DATA ((asi_taints_t)BIT(1)) > +#define ASI_TAINT_GUEST_DATA ((asi_taints_t)BIT(2)) > +#define ASI_TAINT_OTHER_MM_DATA ((asi_taints_t)BIT(3)) > +#define ASI_TAINT_USER_CONTROL ((asi_taints_t)BIT(4)) > +#define ASI_TAINT_GUEST_CONTROL ((asi_taints_t)BIT(5)) > +#define ASI_TAINT_OTHER_MM_CONTROL ((asi_taints_t)BIT(6)) > +#define ASI_NUM_TAINTS 6 > +static_assert(BITS_PER_BYTE * sizeof(asi_taints_t) >= ASI_NUM_TAINTS); Why is this a typedef at all to make the code more unreadable than it needs to be? Why not a simple unsigned int or char or whatever you need? > + > +#define ASI_TAINTS_CONTROL_MASK \ > + (ASI_TAINT_USER_CONTROL | ASI_TAINT_GUEST_CONTROL | ASI_TAINT_OTHER_MM_CONTROL) > + > +#define ASI_TAINTS_DATA_MASK \ > + (ASI_TAINT_KERNEL_DATA | ASI_TAINT_USER_DATA | ASI_TAINT_OTHER_MM_DATA) > + > +#define ASI_TAINTS_GUEST_MASK (ASI_TAINT_GUEST_DATA | ASI_TAINT_GUEST_CONTROL) > +#define ASI_TAINTS_USER_MASK (ASI_TAINT_USER_DATA | ASI_TAINT_USER_CONTROL) > + > +/* The taint policy tells ASI how a class interacts with the CPU taints */ > +struct asi_taint_policy { > + /* > + * What taints would necessitate a flush when entering the domain, to > + * protect it from attack by prior domains? > + */ > + asi_taints_t prevent_control; So if those necessitate a flush, why isn't this var called "taints_to_flush" or whatever which exactly explains what it is? > + /* > + * What taints would necessetate a flush when entering the domain, to + * What taints would necessetate a flush when entering the domain, to Unknown word [necessetate] in comment. Suggestions: ['necessitate', Spellchecker please. Go over your whole set. > + * protect former domains from attack by this domain? > + */ > + asi_taints_t protect_data; Same. > + /* What taints should be set when entering the domain? */ > + asi_taints_t set; So "required_taints" or so... hm? > +}; > + > +/* > + * An ASI domain (struct asi) represents a restricted address space. The no need for "(struct asi)" - it is right below :). > + * unrestricted address space (and user address space under PTI) are not > + * represented as a domain. > + */ > +struct asi { > + pgd_t *pgd; > + struct mm_struct *mm; > + int64_t ref_count; > + enum asi_class_id class_id; > +}; > + > +DECLARE_PER_CPU_ALIGNED(struct asi *, curr_asi); Or simply "asi" - this per-CPU var will be so prominent so that when you do "per_cpu(asi)" you know what exactly it is > + > +void asi_init_mm_state(struct mm_struct *mm); > + > +int asi_init_class(enum asi_class_id class_id, struct asi_taint_policy *taint_policy); > +void asi_uninit_class(enum asi_class_id class_id); "uninit", meh. "exit" perhaps? or "destroy"? And you have "asi_destroy" already so I guess you can do: asi_class_init() asi_class_destroy() to have the namespace correct. > +const char *asi_class_name(enum asi_class_id class_id); > + > +int asi_init(struct mm_struct *mm, enum asi_class_id class_id, struct asi **out_asi); > +void asi_destroy(struct asi *asi); > + > +/* Enter an ASI domain (restricted address space) and begin the critical section. */ > +void asi_enter(struct asi *asi); > + > +/* > + * Leave the "tense" state if we are in it, i.e. end the critical section. We > + * will stay relaxed until the next asi_enter. > + */ > +void asi_relax(void); > + > +/* Immediately exit the restricted address space if in it */ > +void asi_exit(void); > + > +/* The target is the domain we'll enter when returning to process context. */ > +static __always_inline struct asi *asi_get_target(struct task_struct *p) > +{ > + return p->thread.asi_state.target; > +} > + > +static __always_inline void asi_set_target(struct task_struct *p, > + struct asi *target) > +{ > + p->thread.asi_state.target = target; > +} > + > +static __always_inline struct asi *asi_get_current(void) > +{ > + return this_cpu_read(curr_asi); > +} > + > +/* Are we currently in a restricted address space? */ > +static __always_inline bool asi_is_restricted(void) > +{ > + return (bool)asi_get_current(); > +} > + > +/* If we exit/have exited, can we stay that way until the next asi_enter? */ > +static __always_inline bool asi_is_relaxed(void) > +{ > + return !asi_get_target(current); > +} > + > +/* > + * Is the current task in the critical section? > + * > + * This is just the inverse of !asi_is_relaxed(). We have both functions in order to > + * help write intuitive client code. In particular, asi_is_tense returns false > + * when ASI is disabled, which is judged to make user code more obvious. > + */ > +static __always_inline bool asi_is_tense(void) > +{ > + return !asi_is_relaxed(); > +} So can we tone down the silly helpers above? You don't really need asi_is_tense() for example. It is still very intuitive if I read if (!asi_is_relaxed()) ... > + > +static __always_inline pgd_t *asi_pgd(struct asi *asi) > +{ > + return asi ? asi->pgd : NULL; > +} > + > +#define INIT_MM_ASI(init_mm) \ > + .asi_init_lock = __MUTEX_INITIALIZER(init_mm.asi_init_lock), > + > +void asi_handle_switch_mm(void); > + > +#endif /* CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION */ > + > +#endif Splitting the patch here and will continue with the next one as this one is kinda big for one mail. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette