On Tue, Mar 08, 2022 at 09:49:01AM -0800, Reinette Chatre wrote: > Hi Jarkko, > > On 3/8/2022 9:00 AM, Jarkko Sakkinen wrote: > > On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote: > >> Hi Jarkko, > >> > >> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote: > >>> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote: > >>>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote: > >>>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote: > >>>>>> Hi Jarkko, > >>>>>> > >>>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote: > >>>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote: > >>>>>>>> === Summary === > >>>>>>>> > >>>>>>>> An SGX VMA can only be created if its permissions are the same or > >>>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA > >>>>>>>> creation this same rule is again enforced by the page fault handler: > >>>>>>>> faulted enclave pages are required to have equal or more relaxed > >>>>>>>> EPCM permissions than the VMA permissions. > >>>>>>>> > >>>>>>>> On SGX1 systems the additional enforcement in the page fault handler > >>>>>>>> is redundant and on SGX2 systems it incorrectly prevents access. > >>>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the > >>>>>>>> permission rule. The rule used during original VMA creation will > >>>>>>>> ensure that any access attempt will use correct permissions. > >>>>>>>> With SGX2 the EPCM permissions of a page can change after VMA > >>>>>>>> creation resulting in the VMA permissions potentially being more > >>>>>>>> relaxed than the EPCM permissions and the page fault handler > >>>>>>>> incorrectly blocking valid access attempts. > >>>>>>>> > >>>>>>>> Enable the VMA's pages to remain accessible while ensuring that > >>>>>>>> the PTEs are installed to match the EPCM permissions but not be > >>>>>>>> more relaxed than the VMA permissions. > >>>>>>>> > >>>>>>>> === Full Changelog === > >>>>>>>> > >>>>>>>> An SGX enclave is an area of memory where parts of an application > >>>>>>>> can reside. First an enclave is created and loaded (from > >>>>>>>> non-enclave memory) with the code and data of an application, > >>>>>>>> then user space can map (mmap()) the enclave memory to > >>>>>>>> be able to enter the enclave at its defined entry points for > >>>>>>>> execution within it. > >>>>>>>> > >>>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map > >>>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is > >>>>>>>> its tracking of the enclave page permissions. When a page is loaded > >>>>>>>> into the enclave its permissions are specified and recorded in the > >>>>>>>> EPCM. In parallel the kernel maintains permissions within the > >>>>>>>> page table entries (PTEs) and the rule is that PTE permissions > >>>>>>>> are not allowed to be more relaxed than the EPCM permissions. > >>>>>>>> > >>>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the > >>>>>>>> mapping has the same or weaker permissions than the permissions that > >>>>>>>> were vetted during enclave creation. This is enforced by > >>>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect() > >>>>>>>> paths. This rule remains. > >>>>>>>> > >>>>>>>> One feature of SGX2 is to support the modification of EPCM permissions > >>>>>>>> after enclave initialization. Enclave pages may thus already be part > >>>>>>>> of a VMA at the time their EPCM permissions are changed resulting > >>>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM > >>>>>>>> permissions. > >>>>>>>> > >>>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM > >>>>>>>> permissions in preparation for dynamic EPCM permission changes > >>>>>>>> made possible in SGX2. New VMAs that attempt to have more relaxed > >>>>>>>> permissions than EPCM permissions continue to be unsupported. > >>>>>>>> > >>>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed > >>>>>>>> than EPCM permissions instead of dynamically changing VMA permissions > >>>>>>>> when EPCM permissions change are: > >>>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an > >>>>>>>> operation that can fail. Additionally changing EPCM permissions of > >>>>>>>> a range of pages could also fail on any of the pages involved. > >>>>>>>> Handling these error cases causes problems. For example, if an > >>>>>>>> EPCM permission change fails and the VMA has already been split > >>>>>>>> then it is not possible to undo the VMA split nor possible to > >>>>>>>> undo the EPCM permission changes that did succeed before the > >>>>>>>> failure. > >>>>>>>> 2) The kernel has little insight into the user space where EPCM > >>>>>>>> permissions are controlled from. For example, a RW page may > >>>>>>>> be made RO just before it is made RX and splitting the VMAs > >>>>>>>> while the VMAs may change soon is unnecessary. > >>>>>>>> > >>>>>>>> Remove the extra permission check called on a page fault > >>>>>>>> (vm_operations_struct->fault) or during debugging > >>>>>>>> (vm_operations_struct->access) when loading the enclave page from swap > >>>>>>>> that ensures that the VMA permissions are not more relaxed than the > >>>>>>>> EPCM permissions. Since a VMA could only exist if it passed the > >>>>>>>> original permission checks during mmap() and a VMA may indeed > >>>>>>>> have more relaxed permissions than the EPCM permissions this extra > >>>>>>>> permission check is no longer appropriate. > >>>>>>>> > >>>>>>>> With the permission check removed, ensure that PTEs do > >>>>>>>> not blindly inherit the VMA permissions but instead the permissions > >>>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA > >>>>>>>> and enclave perspective) are installed with the writable bit set, > >>>>>>>> reducing the need for this additional flow to the permission mismatch > >>>>>>>> cases handled next. > >>>>>>>> > >>>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> > >>>>>>>> --- > >>>>>>>> Changes since V1: > >>>>>>>> - Reword commit message (Jarkko). > >>>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave). > >>>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the > >>>>>>>> relationship between VMA, EPCM, and PTE permissions on SGX > >>>>>>>> systems (Andy). > >>>>>>>> > >>>>>>>> Documentation/x86/sgx.rst | 10 +++++++++ > >>>>>>>> arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++---------------- > >>>>>>>> 2 files changed, 30 insertions(+), 18 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst > >>>>>>>> index 89ff924b1480..5659932728a5 100644 > >>>>>>>> --- a/Documentation/x86/sgx.rst > >>>>>>>> +++ b/Documentation/x86/sgx.rst > >>>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are: > >>>>>>>> * PTEs are installed to match the EPCM permissions, but not be more > >>>>>>>> relaxed than the VMA permissions. > >>>>>>>> > >>>>>>>> +On systems supporting SGX2 EPCM permissions may change while the > >>>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions. > >>>>>>>> +This means that a running VMA may appear to allow access to an enclave > >>>>>>>> +page that is not allowed by its EPCM permissions. For example, when an > >>>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is > >>>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel > >>>>>>>> +continues to maintain correct access to the enclave page through the > >>>>>>>> +PTE that will ensure that only access allowed by both the VMA > >>>>>>>> +and EPCM permissions are permitted. > >>>>>>>> + > >>>>>>>> Application interface > >>>>>>>> ===================== > >>>>>>>> > >>>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > >>>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644 > >>>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c > >>>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c > >>>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page, > >>>>>>>> } > >>>>>>>> > >>>>>>>> static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > >>>>>>>> - unsigned long addr, > >>>>>>>> - unsigned long vm_flags) > >>>>>>>> + unsigned long addr) > >>>>>>>> { > >>>>>>>> - unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > >>>>>>>> struct sgx_epc_page *epc_page; > >>>>>>>> struct sgx_encl_page *entry; > >>>>>>>> > >>>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl, > >>>>>>>> if (!entry) > >>>>>>>> return ERR_PTR(-EFAULT); > >>>>>>>> > >>>>>>>> - /* > >>>>>>>> - * Verify that the faulted page has equal or higher build time > >>>>>>>> - * permissions than the VMA permissions (i.e. the subset of {VM_READ, > >>>>>>>> - * VM_WRITE, VM_EXECUTE} in vma->vm_flags). > >>>>>>>> - */ > >>>>>>>> - if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits) > >>>>>>>> - return ERR_PTR(-EFAULT); > >>>>>>>> - > >>>>>>>> /* Entry successfully located. */ > >>>>>>>> if (entry->epc_page) { > >>>>>>>> if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > >>>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > >>>>>>>> { > >>>>>>>> unsigned long addr = (unsigned long)vmf->address; > >>>>>>>> struct vm_area_struct *vma = vmf->vma; > >>>>>>>> + unsigned long page_prot_bits; > >>>>>>>> struct sgx_encl_page *entry; > >>>>>>>> + unsigned long vm_prot_bits; > >>>>>>>> unsigned long phys_addr; > >>>>>>>> struct sgx_encl *encl; > >>>>>>>> vm_fault_t ret; > >>>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > >>>>>>>> > >>>>>>>> mutex_lock(&encl->lock); > >>>>>>>> > >>>>>>>> - entry = sgx_encl_load_page(encl, addr, vma->vm_flags); > >>>>>>>> + entry = sgx_encl_load_page(encl, addr); > >>>>>>>> if (IS_ERR(entry)) { > >>>>>>>> mutex_unlock(&encl->lock); > >>>>>>> > >>>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf) > >>>>>>>> > >>>>>>>> phys_addr = sgx_get_epc_phys_addr(entry->epc_page); > >>>>>>>> > >>>>>>>> - ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr)); > >>>>>>>> + /* > >>>>>>>> + * Insert PTE to match the EPCM page permissions ensured to not > >>>>>>>> + * exceed the VMA permissions. > >>>>>>>> + */ > >>>>>>>> + vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC); > >>>>>>>> + page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits; > >>>>>>>> + /* > >>>>>>>> + * Add VM_SHARED so that PTE is made writable right away if VMA > >>>>>>>> + * and EPCM are writable (no COW in SGX). > >>>>>>>> + */ > >>>>>>>> + page_prot_bits |= (vma->vm_flags & VM_SHARED); > >>>>>>>> + ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr), > >>>>>>>> + vm_get_page_prot(page_prot_bits)); > >>>>>>>> if (ret != VM_FAULT_NOPAGE) { > >>>>>>>> mutex_unlock(&encl->lock); > >>>>>>>> > >>>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag > >>>>>>>> * Load an enclave page to EPC if required, and take encl->lock. > >>>>>>>> */ > >>>>>>>> static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl, > >>>>>>>> - unsigned long addr, > >>>>>>>> - unsigned long vm_flags) > >>>>>>>> + unsigned long addr) > >>>>>>>> { > >>>>>>>> struct sgx_encl_page *entry; > >>>>>>>> > >>>>>>>> for ( ; ; ) { > >>>>>>>> mutex_lock(&encl->lock); > >>>>>>>> > >>>>>>>> - entry = sgx_encl_load_page(encl, addr, vm_flags); > >>>>>>>> + entry = sgx_encl_load_page(encl, addr); > >>>>>>>> if (PTR_ERR(entry) != -EBUSY) > >>>>>>>> break; > >>>>>>>> > >>>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr, > >>>>>>>> return -EFAULT; > >>>>>>>> > >>>>>>>> for (i = 0; i < len; i += cnt) { > >>>>>>>> - entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK, > >>>>>>>> - vma->vm_flags); > >>>>>>>> + entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK); > >>>>>>>> if (IS_ERR(entry)) { > >>>>>>>> ret = PTR_ERR(entry); > >>>>>>>> break; > >>>>>>>> -- > >>>>>>>> 2.25.1 > >>>>>>>> > >>>>>>> > >>>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created > >>>>>>> pags, you would not need to do this. > >>>>>>> > >>>>>>> These patches could be then safely dropped then: > >>>>>>> > >>>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions > >>>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes > >>>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions > >>>>>>> > >>>>>>> And that would also keep full ABI compatibility without exceptions to the > >>>>>>> existing mainline code. > >>>>>>> > >>>>>> > >>>>>> Dropping these changes do not just impact dynamically created pages. Dropping > >>>>>> these patches would result in EPCM page permission restriction being supported > >>>>>> for all pages, those added before enclave initialization as well as dynamically > >>>>>> added pages, but their PTEs will not be impacted. > >>>>>> > >>>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and > >>>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux > >>>>>> would keep allowing and installing RW PTEs to this page. > >>>>> > >>>>> I think that would be perfectly fine, if someone wants to do that. There is > >>>>> no corrateral damage on doing that. Kernel does not get messed because of > >>>>> that. It's a use case that does not make sense in the first place, so it'd > >>>>> be stupid to build anything extensive around it to the kernel. > >>>>> > >>>>> Shooting yourself to the foot is something that kernel does and should not > >>>>> protect user space from unless there is a risk of messing the state of the > >>>>> kernel itself. > >>>>> > >>>>> Much worse is that we have e.g. completely artificial ioctl > >>>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g. > >>>>> cause extra roundtrips for simple EMODPE. > >>>>> > >>>>> Also this means not having to include 06/32, which keeps 100% backwards > >>>>> compatibility in run-time behaviour to the mainline while not restricting > >>>>> at all dynamically created pages. And we get rid of complex book keeping > >>>>> of vm_run_prot_bits. > >>>>> > >>>>> And generally the whole model is then very easy to understand and explain. > >>>>> If I had to keep presentation of the current mess in the patch set in a > >>>>> conference, I can honestly say that I would be in serious trouble. It's > >>>>> not clean and clear security model, which is a risk by itself. > >>>> > >>>> I.e. > >>>> > >>>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do > >>>> not change it by any means (e.g. 06/32). > >>>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do > >>>> what ever you want with PTE's and EPCM. > >>>> > >>>> It's a clear and understandable model that does nothing bad to the kernel, > >>>> and a run-time developer can surely find away to get things on going. For > >>>> user space, the most important thing is the clarity in kernel behaviour, > >>>> and this does deliver that clarity. It's not perfect but it does do the > >>>> job and anyone can get it. > >>> > >>> Also a quantitive argument for this is that by simplifying security model > >>> this way it is one ioctl less, which must be considered as +1. We do not > >>> want to add new ioctls unless it is something we absolutely cannnot live > >>> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS. > >>> > >> > >> ok, with the implications understood and accepted I will proceed with a new > >> series that separates EPCM from PTEs and make RWX PTEs possible by default > >> for EAUG pages. This has broader impact than just removing > >> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler > >> for present PTEs" is also no longer needed and there is no longer a need > >> to flush PTEs after restricting permissions. New changes also need to > >> be considered - at least the current documentation. I'll rework the series. > > > > Yes, I really think it is a solid plan. Any possible LSM hooks would most > > likely attach to build product, not the dynamic behaviour. > > > > As far as the page fault handler goes, Haitao is correct after the all > > discussions that it makes sense. The purpose of MAP_POPULATE series is > > not to replace it but instead complement it. Just wanted to clear this > > up as I said otherwise earlier this week. > > > > Understood. I will keep the implementation where EAUG is done in page fault > handler. I do plan to pick up your patch "x86/sgx: Export sgx_encl_page_alloc()" > since a consequence of the other changes is that this can now be shared. Yeah, I think we might be able to get this polished for v5.19. I'd expect a revision or few for polishing the corners but other than that this looks to be going on right tracks now. > Reinette BR, Jarkko