On Mon, Nov 02, 2020 at 12:53:15PM -0800, ira.weiny@xxxxxxxxx wrote: > From: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > PKS allows kernel users to define domains of page mappings which have > additional protections beyond the paging protections. > > Add an API to allocate, use, and free a protection key which identifies > such a domain. Export 5 new symbols pks_key_alloc(), pks_mknoaccess(), > pks_mkread(), pks_mkrdwr(), and pks_key_free(). Add 2 new macros; > PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey). > > Update the protection key documentation to cover pkeys on supervisor > pages. > > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > --- > Changes from V1 > Per Dave Hansen > Add flags to pks_key_alloc() to help future proof the > interface if/when the key space is exhausted. > > Changes from RFC V3 > Per Dave Hansen > Put WARN_ON_ONCE in pks_key_free() > s/pks_mknoaccess/pks_mk_noaccess/ > s/pks_mkread/pks_mk_readonly/ > s/pks_mkrdwr/pks_mk_readwrite/ > Change return pks_key_alloc() to EOPNOTSUPP when not > supported or configured > Per Peter Zijlstra > Remove unneeded preempt disable/enable > --- > Documentation/core-api/protection-keys.rst | 102 ++++++++++++++--- > arch/x86/include/asm/pgtable_types.h | 12 ++ > arch/x86/include/asm/pkeys.h | 11 ++ > arch/x86/include/asm/pkeys_common.h | 4 + > arch/x86/mm/pkeys.c | 126 +++++++++++++++++++++ > include/linux/pgtable.h | 4 + > include/linux/pkeys.h | 24 ++++ > 7 files changed, 265 insertions(+), 18 deletions(-) > > diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst > index ec575e72d0b2..c4e6c480562f 100644 > --- a/Documentation/core-api/protection-keys.rst > +++ b/Documentation/core-api/protection-keys.rst > @@ -4,25 +4,33 @@ > Memory Protection Keys > ====================== > > -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature > -which is found on Intel's Skylake (and later) "Scalable Processor" > -Server CPUs. It will be available in future non-server Intel parts > -and future AMD processors. > - > -For anyone wishing to test or use this feature, it is available in > -Amazon's EC2 C5 instances and is known to work there using an Ubuntu > -17.04 image. > - > Memory Protection Keys provides a mechanism for enforcing page-based > protections, but without requiring modification of the page tables > -when an application changes protection domains. It works by > -dedicating 4 previously ignored bits in each page table entry to a > -"protection key", giving 16 possible keys. > +when an application changes protection domains. > + > +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable > +Processor" Server CPUs and later. And It will be available in future > +non-server Intel parts and future AMD processors. > + > +Future Intel processors will support Protection Keys for Supervisor pages > +(PKS). > + > +For anyone wishing to test or use user space pkeys, it is available in Amazon's > +EC2 C5 instances and is known to work there using an Ubuntu 17.04 image. > + > +pkeys work by dedicating 4 previously Reserved bits in each page table entry to > +a "protection key", giving 16 possible keys. User and Supervisor pages are > +treated separately. > + > +Protections for each page are controlled with per CPU registers for each type > +of page User and Supervisor. Each of these 32 bit register stores two separate > +bits (Access Disable and Write Disable) for each key. > > -There is also a new user-accessible register (PKRU) with two separate > -bits (Access Disable and Write Disable) for each key. Being a CPU > -register, PKRU is inherently thread-local, potentially giving each > -thread a different set of protections from every other thread. > +For Userspace the register is user-accessible (rdpkru/wrpkru). For > +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel. > + > +Being a CPU register, pkeys are inherently thread-local, potentially giving > +each thread an independent set of protections from every other thread. > > There are two new instructions (RDPKRU/WRPKRU) for reading and writing > to the new register. The feature is only available in 64-bit mode, > @@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs. These > permissions are enforced on data access only and have no effect on > instruction fetches. > > -Syscalls > -======== > +For kernel space rdmsr/wrmsr are used to access the kernel MSRs. > + > + > +Syscalls for user space keys > +============================ > > There are 3 system calls which directly interact with pkeys:: > > @@ -98,3 +109,58 @@ with a read():: > The kernel will send a SIGSEGV in both cases, but si_code will be set > to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when > the plain mprotect() permissions are violated. > + > + > +Kernel API for PKS support > +========================== > + > +The following interface is used to allocate, use, and free a pkey which defines > +a 'protection domain' within the kernel. Setting a pkey value in a supervisor > +mapping adds that mapping to the protection domain. > + > + int pks_key_alloc(const char * const pkey_user, int flags); > + #define PAGE_KERNEL_PKEY(pkey) > + #define _PAGE_KEY(pkey) > + void pks_mk_noaccess(int pkey); > + void pks_mk_readonly(int pkey); > + void pks_mk_readwrite(int pkey); > + void pks_key_free(int pkey); > + > +pks_key_alloc() allocates keys dynamically to allow better use of the limited > +key space. 'flags' alter the allocation based on the users need. Currently > +they can request an exclusive key. > + > +Callers of pks_key_alloc() _must_ be prepared for it to fail and take > +appropriate action. This is due mainly to the fact that PKS may not be > +available on all arch's. Failure to check the return of pks_key_alloc() and > +using any of the rest of the API is undefined. > + > +Kernel users must set the PTE permissions in the page table entries for the > +mappings they want to protect. This can be done with PAGE_KERNEL_PKEY() or > +_PAGE_KEY(). > + > +The pks_mk*() family of calls allows kernel users the ability to change the > +protections for the domain identified by the pkey specified. 3 states are > +available pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which > +set the access to none, read, and read/write respectively. > + > +Finally, pks_key_free() allows a user to return the key to the allocator for > +use by others. > + > +The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys > +not currently allocated. Therefore, the user can depend on access being > +disabled when pks_key_alloc() returns a key and the user should remove mappings > +from the domain (remove the pkey from the PTE) prior to calling pks_key_free(). > + > +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing > +but still maintains ordering properties similar to WRPKRU. Thus it is safe to > +immediately use a mapping when the pks_mk*() functions returns. > + > +The current SDM section on PKRS needs updating but should be the same as that > +of WRPKRU. So to quote from the WRPKRU text: > + > + WRPKRU will never execute transiently. Memory accesses > + affected by PKRU register will not execute (even transiently) > + until all prior executions of WRPKRU have completed execution > + and updated the PKRU register. > + > diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > index 816b31c68550..c9fdfbdcbbfb 100644 > --- a/arch/x86/include/asm/pgtable_types.h > +++ b/arch/x86/include/asm/pgtable_types.h > @@ -73,6 +73,12 @@ > _PAGE_PKEY_BIT2 | \ > _PAGE_PKEY_BIT3) > > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > +#define _PAGE_PKEY(pkey) (_AT(pteval_t, pkey) << _PAGE_BIT_PKEY_BIT0) > +#else > +#define _PAGE_PKEY(pkey) (_AT(pteval_t, 0)) > +#endif > + > #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) > #define _PAGE_KNL_ERRATUM_MASK (_PAGE_DIRTY | _PAGE_ACCESSED) > #else > @@ -229,6 +235,12 @@ enum page_cache_mode { > #define PAGE_KERNEL_IO __pgprot_mask(__PAGE_KERNEL_IO) > #define PAGE_KERNEL_IO_NOCACHE __pgprot_mask(__PAGE_KERNEL_IO_NOCACHE) > > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > +#define PAGE_KERNEL_PKEY(pkey) __pgprot_mask(__PAGE_KERNEL | _PAGE_PKEY(pkey)) > +#else > +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL > +#endif > + > #endif /* __ASSEMBLY__ */ > > /* xwr */ > diff --git a/arch/x86/include/asm/pkeys.h b/arch/x86/include/asm/pkeys.h > index 4526245b03e5..f84351b4ac7c 100644 > --- a/arch/x86/include/asm/pkeys.h > +++ b/arch/x86/include/asm/pkeys.h > @@ -3,6 +3,7 @@ > #define _ASM_X86_PKEYS_H > > #include <asm/pkeys_common.h> > +#include <asm-generic/mman-common.h> > > #define ARCH_DEFAULT_PKEY 0 > > @@ -138,4 +139,14 @@ static inline int vma_pkey(struct vm_area_struct *vma) > > u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags); > > +#ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > +int pks_key_alloc(const char *const pkey_user, int flags); > +void pks_key_free(int pkey); > + > +void pks_mk_noaccess(int pkey); > +void pks_mk_readonly(int pkey); > +void pks_mk_readwrite(int pkey); > + > +#endif /* CONFIG_ARCH_HAS_SUPERVISOR_PKEYS */ > + > #endif /*_ASM_X86_PKEYS_H */ > diff --git a/arch/x86/include/asm/pkeys_common.h b/arch/x86/include/asm/pkeys_common.h > index 801a75615209..cd492c23b28c 100644 > --- a/arch/x86/include/asm/pkeys_common.h > +++ b/arch/x86/include/asm/pkeys_common.h > @@ -26,6 +26,10 @@ > PKR_AD_KEY(10) | PKR_AD_KEY(11) | PKR_AD_KEY(12) | \ > PKR_AD_KEY(13) | PKR_AD_KEY(14) | PKR_AD_KEY(15)) > > +/* PKS supports 16 keys. Key 0 is reserved for the kernel. */ > +#define PKS_KERN_DEFAULT_KEY 0 > +#define PKS_NUM_KEYS 16 > + > #ifdef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > void write_pkrs(u32 new_pkrs); > #else > diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c > index 76a62419c446..0dc77409957a 100644 > --- a/arch/x86/mm/pkeys.c > +++ b/arch/x86/mm/pkeys.c > @@ -3,6 +3,9 @@ > * Intel Memory Protection Keys management > * Copyright (c) 2015, Intel Corporation. > */ > +#undef pr_fmt > +#define pr_fmt(fmt) "x86/pkeys: " fmt > + > #include <linux/debugfs.h> /* debugfs_create_u32() */ > #include <linux/mm_types.h> /* mm_struct, vma, etc... */ > #include <linux/pkeys.h> /* PKEY_* */ > @@ -231,6 +234,7 @@ u32 update_pkey_val(u32 pk_reg, int pkey, unsigned int flags) > > return pk_reg; > } > +EXPORT_SYMBOL_GPL(update_pkey_val); > > DEFINE_PER_CPU(u32, pkrs_cache); > > @@ -262,3 +266,125 @@ void write_pkrs(u32 new_pkrs) > } > put_cpu_ptr(pkrs); > } > +EXPORT_SYMBOL_GPL(write_pkrs); > + > +/** > + * Do not call this directly, see pks_mk*() below. > + * > + * @pkey: Key for the domain to change > + * @protection: protection bits to be used > + * > + * Protection utilizes the same protection bits specified for User pkeys > + * PKEY_DISABLE_ACCESS > + * PKEY_DISABLE_WRITE > + * > + */ > +static inline void pks_update_protection(int pkey, unsigned long protection) > +{ > + current->thread.saved_pkrs = update_pkey_val(current->thread.saved_pkrs, > + pkey, protection); > + write_pkrs(current->thread.saved_pkrs); > +} > + > +/** > + * PKS access control functions > + * > + * Change the access of the domain specified by the pkey. These are global > + * updates. They only affects the current running thread. It is undefined and > + * a bug for users to call this without having allocated a pkey and using it as > + * pkey here. > + * > + * pks_mk_noaccess() > + * Disable all access to the domain > + * pks_mk_readonly() > + * Make the domain Read only > + * pks_mk_readwrite() > + * Make the domain Read/Write > + * > + * @pkey the pkey for which the access should change. > + * > + */ > +void pks_mk_noaccess(int pkey) > +{ > + pks_update_protection(pkey, PKEY_DISABLE_ACCESS); > +} > +EXPORT_SYMBOL_GPL(pks_mk_noaccess); > + > +void pks_mk_readonly(int pkey) > +{ > + pks_update_protection(pkey, PKEY_DISABLE_WRITE); > +} > +EXPORT_SYMBOL_GPL(pks_mk_readonly); > + > +void pks_mk_readwrite(int pkey) > +{ > + pks_update_protection(pkey, 0); > +} > +EXPORT_SYMBOL_GPL(pks_mk_readwrite); > + > +static const char pks_key_user0[] = "kernel"; > + > +/* Store names of allocated keys for debug. Key 0 is reserved for the kernel. */ > +static const char *pks_key_users[PKS_NUM_KEYS] = { > + pks_key_user0 > +}; > + > +/* > + * Each key is represented by a bit. Bit 0 is set for key 0 and reserved for > + * its use. We use ulong for the bit operations but only 16 bits are used. > + */ > +static unsigned long pks_key_allocation_map = 1 << PKS_KERN_DEFAULT_KEY; > + > +/* > + * pks_key_alloc - Allocate a PKS key > + * @pkey_user: String stored for debugging of key exhaustion. The caller is > + * responsible to maintain this memory until pks_key_free(). > + * @flags: Flags to modify behavior: see pks_alloc_flags > + * > + * Returns: pkey if success > + * -EOPNOTSUPP if pks is not supported or not enabled > + * -ENOSPC if no keys are available (even for sharing) > + */ > +int pks_key_alloc(const char * const pkey_user, int flags) > +{ > + int nr; > + > + if (!cpu_feature_enabled(X86_FEATURE_PKS)) > + return -EOPNOTSUPP; > + > + while (1) { > + nr = find_first_zero_bit(&pks_key_allocation_map, PKS_NUM_KEYS); > + if (nr >= PKS_NUM_KEYS) { > + pr_info("Cannot allocate supervisor key for %s.\n", > + pkey_user); > + return -ENOSPC; > + } > + if (!test_and_set_bit_lock(nr, &pks_key_allocation_map)) > + break; > + } > + > + /* for debugging key exhaustion */ > + pks_key_users[nr] = pkey_user; > + > + return nr; > +} > +EXPORT_SYMBOL_GPL(pks_key_alloc); > + > +/* > + * pks_key_free - Free a previously allocate PKS key > + * @pkey: Key to be free'ed > + */ > +void pks_key_free(int pkey) > +{ > + if (!cpu_feature_enabled(X86_FEATURE_PKS)) > + return; > + > + if (WARN_ON_ONCE(pkey >= PKS_NUM_KEYS || pkey <= PKS_KERN_DEFAULT_KEY)) > + return; > + > + /* Restore to default of no access */ > + pks_mk_noaccess(pkey); > + pks_key_users[pkey] = NULL; > + __clear_bit(pkey, &pks_key_allocation_map); > +} > +EXPORT_SYMBOL_GPL(pks_key_free); > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 38c33eabea89..cd72d73e8e1c 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -1402,6 +1402,10 @@ static inline bool arch_has_pfn_modify_check(void) > # define PAGE_KERNEL_EXEC PAGE_KERNEL > #endif > > +#ifndef PAGE_KERNEL_PKEY > +#define PAGE_KERNEL_PKEY(pkey) PAGE_KERNEL > +#endif > + > /* > * Page Table Modification bits for pgtbl_mod_mask. > * > diff --git a/include/linux/pkeys.h b/include/linux/pkeys.h > index 2955ba976048..0959a4c0ca64 100644 > --- a/include/linux/pkeys.h > +++ b/include/linux/pkeys.h > @@ -50,4 +50,28 @@ static inline void copy_init_pkru_to_fpregs(void) > > #endif /* ! CONFIG_ARCH_HAS_PKEYS */ > > +#define PKS_FLAG_EXCLUSIVE 0x00 > + > +#ifndef CONFIG_ARCH_HAS_SUPERVISOR_PKEYS > +static inline int pks_key_alloc(const char * const pkey_user, int flags) > +{ > + return -EOPNOTSUPP; > +} > +static inline void pks_key_free(int pkey) > +{ > +} > +static inline void pks_mk_noaccess(int pkey) > +{ > + WARN_ON_ONCE(1); So for panic-on-warn systems, this is ok to reboot the box? Are you sure, that feels odd... greg k-h