Re: [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Tue, 8 Mar 2022 09:49:01 -0800

Hi Jarkko,

On 3/8/2022 9:00 AM, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 08:04:33AM -0800, Reinette Chatre wrote:
>> Hi Jarkko,
>>
>> On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
>>> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
>>>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
>>>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
>>>>>> Hi Jarkko,
>>>>>>
>>>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
>>>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>>>>>>>> === Summary ===
>>>>>>>>
>>>>>>>> An SGX VMA can only be created if its permissions are the same or
>>>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>>>>>>>> creation this same rule is again enforced by the page fault handler:
>>>>>>>> faulted enclave pages are required to have equal or more relaxed
>>>>>>>> EPCM permissions than the VMA permissions.
>>>>>>>>
>>>>>>>> On SGX1 systems the additional enforcement in the page fault handler
>>>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
>>>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>>>>>>>> permission rule. The rule used during original VMA creation will
>>>>>>>> ensure that any access attempt will use correct permissions.
>>>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
>>>>>>>> creation resulting in the VMA permissions potentially being more
>>>>>>>> relaxed than the EPCM permissions and the page fault handler
>>>>>>>> incorrectly blocking valid access attempts.
>>>>>>>>
>>>>>>>> Enable the VMA's pages to remain accessible while ensuring that
>>>>>>>> the PTEs are installed to match the EPCM permissions but not be
>>>>>>>> more relaxed than the VMA permissions.
>>>>>>>>
>>>>>>>> === Full Changelog ===
>>>>>>>>
>>>>>>>> An SGX enclave is an area of memory where parts of an application
>>>>>>>> can reside. First an enclave is created and loaded (from
>>>>>>>> non-enclave memory) with the code and data of an application,
>>>>>>>> then user space can map (mmap()) the enclave memory to
>>>>>>>> be able to enter the enclave at its defined entry points for
>>>>>>>> execution within it.
>>>>>>>>
>>>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
>>>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
>>>>>>>> its tracking of the enclave page permissions. When a page is loaded
>>>>>>>> into the enclave its permissions are specified and recorded in the
>>>>>>>> EPCM. In parallel the kernel maintains permissions within the
>>>>>>>> page table entries (PTEs) and the rule is that PTE permissions
>>>>>>>> are not allowed to be more relaxed than the EPCM permissions.
>>>>>>>>
>>>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
>>>>>>>> mapping has the same or weaker permissions than the permissions that
>>>>>>>> were vetted during enclave creation. This is enforced by
>>>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>>>>>>>> paths. This rule remains.
>>>>>>>>
>>>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
>>>>>>>> after enclave initialization. Enclave pages may thus already be part
>>>>>>>> of a VMA at the time their EPCM permissions are changed resulting
>>>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
>>>>>>>> permissions.
>>>>>>>>
>>>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
>>>>>>>> permissions in preparation for dynamic EPCM permission changes
>>>>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
>>>>>>>> permissions than EPCM permissions continue to be unsupported.
>>>>>>>>
>>>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>>>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
>>>>>>>> when EPCM permissions change are:
>>>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
>>>>>>>>    operation that can fail. Additionally changing EPCM permissions of
>>>>>>>>    a range of pages could also fail on any of the pages involved.
>>>>>>>>    Handling these error cases causes problems. For example, if an
>>>>>>>>    EPCM permission change fails and the VMA has already been split
>>>>>>>>    then it is not possible to undo the VMA split nor possible to
>>>>>>>>    undo the EPCM permission changes that did succeed before the
>>>>>>>>    failure.
>>>>>>>> 2) The kernel has little insight into the user space where EPCM
>>>>>>>>    permissions are controlled from. For example, a RW page may
>>>>>>>>    be made RO just before it is made RX and splitting the VMAs
>>>>>>>>    while the VMAs may change soon is unnecessary.
>>>>>>>>
>>>>>>>> Remove the extra permission check called on a page fault
>>>>>>>> (vm_operations_struct->fault) or during debugging
>>>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
>>>>>>>> that ensures that the VMA permissions are not more relaxed than the
>>>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
>>>>>>>> original permission checks during mmap() and a VMA may indeed
>>>>>>>> have more relaxed permissions than the EPCM permissions this extra
>>>>>>>> permission check is no longer appropriate.
>>>>>>>>
>>>>>>>> With the permission check removed, ensure that PTEs do
>>>>>>>> not blindly inherit the VMA permissions but instead the permissions
>>>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>>>>>>>> and enclave perspective) are installed with the writable bit set,
>>>>>>>> reducing the need for this additional flow to the permission mismatch
>>>>>>>> cases handled next.
>>>>>>>>
>>>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
>>>>>>>> ---
>>>>>>>> Changes since V1:
>>>>>>>> - Reword commit message (Jarkko).
>>>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>>>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>>>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
>>>>>>>>   systems (Andy).
>>>>>>>>
>>>>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
>>>>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>>>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>>>>>>>> index 89ff924b1480..5659932728a5 100644
>>>>>>>> --- a/Documentation/x86/sgx.rst
>>>>>>>> +++ b/Documentation/x86/sgx.rst
>>>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>>>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
>>>>>>>>    relaxed than the VMA permissions.
>>>>>>>>  
>>>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
>>>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
>>>>>>>> +This means that a running VMA may appear to allow access to an enclave
>>>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
>>>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>>>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
>>>>>>>> +continues to maintain correct access to the enclave page through the
>>>>>>>> +PTE that will ensure that only access allowed by both the VMA
>>>>>>>> +and EPCM permissions are permitted.
>>>>>>>> +
>>>>>>>>  Application interface
>>>>>>>>  =====================
>>>>>>>>  
>>>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
>>>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>> -						unsigned long addr,
>>>>>>>> -						unsigned long vm_flags)
>>>>>>>> +						unsigned long addr)
>>>>>>>>  {
>>>>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>>  	struct sgx_epc_page *epc_page;
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>>  
>>>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>>>  	if (!entry)
>>>>>>>>  		return ERR_PTR(-EFAULT);
>>>>>>>>  
>>>>>>>> -	/*
>>>>>>>> -	 * Verify that the faulted page has equal or higher build time
>>>>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>>>>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>>>>>>>> -	 */
>>>>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>>>>>>>> -		return ERR_PTR(-EFAULT);
>>>>>>>> -
>>>>>>>>  	/* Entry successfully located. */
>>>>>>>>  	if (entry->epc_page) {
>>>>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  {
>>>>>>>>  	unsigned long addr = (unsigned long)vmf->address;
>>>>>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>>>>> +	unsigned long page_prot_bits;
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>> +	unsigned long vm_prot_bits;
>>>>>>>>  	unsigned long phys_addr;
>>>>>>>>  	struct sgx_encl *encl;
>>>>>>>>  	vm_fault_t ret;
>>>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  
>>>>>>>>  	mutex_lock(&encl->lock);
>>>>>>>>  
>>>>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>>>>>>>> +	entry = sgx_encl_load_page(encl, addr);
>>>>>>>>  	if (IS_ERR(entry)) {
>>>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>>   
>>>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>>>  
>>>>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>>>>>>>  
>>>>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>>>>>>>> +	/*
>>>>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
>>>>>>>> +	 * exceed the VMA permissions.
>>>>>>>> +	 */
>>>>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>>>>>>>> +	/*
>>>>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
>>>>>>>> +	 * and EPCM are writable (no COW in SGX).
>>>>>>>> +	 */
>>>>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
>>>>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>>>>>>>> +				  vm_get_page_prot(page_prot_bits));
>>>>>>>>  	if (ret != VM_FAULT_NOPAGE) {
>>>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>>>  
>>>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>>>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
>>>>>>>>   */
>>>>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>>>>>>>> -						   unsigned long addr,
>>>>>>>> -						   unsigned long vm_flags)
>>>>>>>> +						   unsigned long addr)
>>>>>>>>  {
>>>>>>>>  	struct sgx_encl_page *entry;
>>>>>>>>  
>>>>>>>>  	for ( ; ; ) {
>>>>>>>>  		mutex_lock(&encl->lock);
>>>>>>>>  
>>>>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
>>>>>>>> +		entry = sgx_encl_load_page(encl, addr);
>>>>>>>>  		if (PTR_ERR(entry) != -EBUSY)
>>>>>>>>  			break;
>>>>>>>>  
>>>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>>>>>>>  		return -EFAULT;
>>>>>>>>  
>>>>>>>>  	for (i = 0; i < len; i += cnt) {
>>>>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>>>>>>>> -					      vma->vm_flags);
>>>>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>>>>>>>  		if (IS_ERR(entry)) {
>>>>>>>>  			ret = PTR_ERR(entry);
>>>>>>>>  			break;
>>>>>>>> -- 
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
>>>>>>> pags, you would not need to do this.
>>>>>>>
>>>>>>> These patches could be then safely dropped then:
>>>>>>>
>>>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
>>>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
>>>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
>>>>>>>
>>>>>>> And that would also keep full ABI compatibility without exceptions to the
>>>>>>> existing mainline code.
>>>>>>>
>>>>>>
>>>>>> Dropping these changes do not just impact dynamically created pages. Dropping
>>>>>> these patches would result in EPCM page permission restriction being supported
>>>>>> for all pages, those added before enclave initialization as well as dynamically
>>>>>> added pages, but their PTEs will not be impacted.
>>>>>>
>>>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
>>>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
>>>>>> would keep allowing and installing RW PTEs to this page.
>>>>>
>>>>> I think that would be perfectly fine, if someone wants to do that. There is
>>>>> no corrateral damage on doing that. Kernel does not get messed because of
>>>>> that. It's a use case that does not make sense in the first place, so it'd
>>>>> be stupid to build anything extensive around it to the kernel.
>>>>>
>>>>> Shooting yourself to the foot is something that kernel does and should not
>>>>> protect user space from unless there is a risk of messing the state of the
>>>>> kernel itself.
>>>>>
>>>>> Much worse is that we have e.g. completely artificial ioctl
>>>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
>>>>> cause extra roundtrips for simple EMODPE.
>>>>>
>>>>> Also this means not having to include 06/32, which keeps 100% backwards
>>>>> compatibility in run-time behaviour to the mainline while not restricting
>>>>> at all dynamically created pages. And we get rid of complex book keeping
>>>>> of vm_run_prot_bits.
>>>>>
>>>>> And generally the whole model is then very easy to understand and explain.
>>>>> If I had to keep presentation of the current mess in the patch set in a
>>>>> conference, I can honestly say that I would be in serious trouble. It's
>>>>> not clean and clear security model, which is a risk by itself.
>>>>
>>>> I.e.
>>>>
>>>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>>>>    not change it by any means (e.g. 06/32).
>>>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>>>>    what ever you want with PTE's and EPCM.
>>>>
>>>> It's a clear and understandable model that does nothing bad to the kernel,
>>>> and a run-time developer can surely find away to get things on going. For
>>>> user space, the most important thing is the clarity in kernel behaviour,
>>>> and this does deliver that clarity. It's not perfect but it does do the
>>>> job and anyone can get it.
>>>
>>> Also a quantitive argument for this is that by simplifying security model
>>> this way it is one ioctl less, which must be considered as +1. We do not
>>> want to add new ioctls unless it is something we absolutely cannnot live
>>> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
>>>
>>
>> ok, with the implications understood and accepted I will proceed with a new
>> series that separates EPCM from PTEs and make RWX PTEs possible by default
>> for EAUG pages. This has broader impact than just removing
>> the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
>> for present PTEs" is also no longer needed and there is no longer a need
>> to flush PTEs after restricting permissions. New changes also need to
>> be considered - at least the current documentation. I'll rework the series.
> 
> Yes, I really think it is a solid plan. Any possible LSM hooks would most
> likely attach to build product, not the dynamic behaviour.
> 
> As far as the page fault handler goes, Haitao is correct after the all
> discussions that it makes sense. The purpose of MAP_POPULATE series is
> not to replace it but instead complement it. Just wanted to clear this
> up as I said otherwise earlier this week.
> 

Understood. I will keep the implementation where EAUG is done in page fault
handler. I do plan to pick up your patch "x86/sgx: Export sgx_encl_page_alloc()"
since a consequence of the other changes is that this can now be shared.

Reinette