On 11/6/20 3:29 PM, ira.weiny@xxxxxxxxx wrote: > From: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > PKS allows kernel users to define domains of page mappings which have > additional protections beyond the paging protections. > > Add an API to allocate, use, and free a protection key which identifies > such a domain. Export 5 new symbols pks_key_alloc(), pks_mknoaccess(), > pks_mkread(), pks_mkrdwr(), and pks_key_free(). Add 2 new macros; > PAGE_KERNEL_PKEY(key) and _PAGE_PKEY(pkey). > > Update the protection key documentation to cover pkeys on supervisor > pages. > > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Fenghua Yu <fenghua.yu@xxxxxxxxx> > > --- > --- > Documentation/core-api/protection-keys.rst | 102 +++++++++++++--- > arch/x86/include/asm/pgtable_types.h | 12 ++ > arch/x86/include/asm/pkeys.h | 11 ++ > arch/x86/include/asm/pkeys_common.h | 4 + > arch/x86/mm/pkeys.c | 128 +++++++++++++++++++++ > include/linux/pgtable.h | 4 + > include/linux/pkeys.h | 24 ++++ > 7 files changed, 267 insertions(+), 18 deletions(-) > > diff --git a/Documentation/core-api/protection-keys.rst b/Documentation/core-api/protection-keys.rst > index ec575e72d0b2..c4e6c480562f 100644 > --- a/Documentation/core-api/protection-keys.rst > +++ b/Documentation/core-api/protection-keys.rst > @@ -4,25 +4,33 @@ > Memory Protection Keys > ====================== > > -Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature > -which is found on Intel's Skylake (and later) "Scalable Processor" > -Server CPUs. It will be available in future non-server Intel parts > -and future AMD processors. > - > -For anyone wishing to test or use this feature, it is available in > -Amazon's EC2 C5 instances and is known to work there using an Ubuntu > -17.04 image. > - > Memory Protection Keys provides a mechanism for enforcing page-based provide > protections, but without requiring modification of the page tables > -when an application changes protection domains. It works by > -dedicating 4 previously ignored bits in each page table entry to a > -"protection key", giving 16 possible keys. > +when an application changes protection domains. > + > +PKeys Userspace (PKU) is a feature which is found on Intel's Skylake "Scalable > +Processor" Server CPUs and later. And It will be available in future it > +non-server Intel parts and future AMD processors. > + > +Future Intel processors will support Protection Keys for Supervisor pages > +(PKS). > + > +For anyone wishing to test or use user space pkeys, it is available in Amazon's > +EC2 C5 instances and is known to work there using an Ubuntu 17.04 image. > + > +pkeys work by dedicating 4 previously Reserved bits in each page table entry to > +a "protection key", giving 16 possible keys. User and Supervisor pages are > +treated separately. > + > +Protections for each page are controlled with per CPU registers for each type per-CPU > +of page User and Supervisor. Each of these 32 bit register stores two separate 32-bit registers > +bits (Access Disable and Write Disable) for each key. > > -There is also a new user-accessible register (PKRU) with two separate > -bits (Access Disable and Write Disable) for each key. Being a CPU > -register, PKRU is inherently thread-local, potentially giving each > -thread a different set of protections from every other thread. > +For Userspace the register is user-accessible (rdpkru/wrpkru). For > +Supervisor, the register (MSR_IA32_PKRS) is accessible only to the kernel. > + > +Being a CPU register, pkeys are inherently thread-local, potentially giving > +each thread an independent set of protections from every other thread. > > There are two new instructions (RDPKRU/WRPKRU) for reading and writing > to the new register. The feature is only available in 64-bit mode, > @@ -30,8 +38,11 @@ even though there is theoretically space in the PAE PTEs. These > permissions are enforced on data access only and have no effect on > instruction fetches. > > -Syscalls > -======== > +For kernel space rdmsr/wrmsr are used to access the kernel MSRs. > + > + > +Syscalls for user space keys > +============================ > > There are 3 system calls which directly interact with pkeys:: > > @@ -98,3 +109,58 @@ with a read():: > The kernel will send a SIGSEGV in both cases, but si_code will be set > to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when > the plain mprotect() permissions are violated. > + > + > +Kernel API for PKS support > +========================== > + > +The following interface is used to allocate, use, and free a pkey which defines > +a 'protection domain' within the kernel. Setting a pkey value in a supervisor > +mapping adds that mapping to the protection domain. > + > + int pks_key_alloc(const char * const pkey_user, int flags); > + #define PAGE_KERNEL_PKEY(pkey) > + #define _PAGE_KEY(pkey) > + void pks_mk_noaccess(int pkey); > + void pks_mk_readonly(int pkey); > + void pks_mk_readwrite(int pkey); > + void pks_key_free(int pkey); > + > +pks_key_alloc() allocates keys dynamically to allow better use of the limited > +key space. 'flags' alter the allocation based on the users need. Currently user's or maybe users' > +they can request an exclusive key. > + > +Callers of pks_key_alloc() _must_ be prepared for it to fail and take > +appropriate action. This is due mainly to the fact that PKS may not be > +available on all arch's. Failure to check the return of pks_key_alloc() and > +using any of the rest of the API is undefined. > + > +Kernel users must set the PTE permissions in the page table entries for the > +mappings they want to protect. This can be done with PAGE_KERNEL_PKEY() or > +_PAGE_KEY(). > + > +The pks_mk*() family of calls allows kernel users the ability to change the > +protections for the domain identified by the pkey specified. 3 states are > +available pks_mk_noaccess(), pks_mk_readonly(), and pks_mk_readwrite() which available: > +set the access to none, read, and read/write respectively. > + > +Finally, pks_key_free() allows a user to return the key to the allocator for > +use by others. > + > +The interface maintains pks_mk_noaccess() (Access Disabled (AD=1)) for all keys > +not currently allocated. Therefore, the user can depend on access being > +disabled when pks_key_alloc() returns a key and the user should remove mappings > +from the domain (remove the pkey from the PTE) prior to calling pks_key_free(). > + > +It should be noted that the underlying WRMSR(MSR_IA32_PKRS) is not serializing > +but still maintains ordering properties similar to WRPKRU. Thus it is safe to > +immediately use a mapping when the pks_mk*() functions returns. return. > + > +The current SDM section on PKRS needs updating but should be the same as that > +of WRPKRU. So to quote from the WRPKRU text: > + > + WRPKRU will never execute transiently. Memory accesses > + affected by PKRU register will not execute (even transiently) > + until all prior executions of WRPKRU have completed execution > + and updated the PKRU register. > + -- ~Randy