Re: [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Tue, 8 Mar 2022 08:04:33 -0800

Hi Jarkko,

On 3/8/2022 1:12 AM, Jarkko Sakkinen wrote:
> On Tue, Mar 08, 2022 at 11:06:46AM +0200, Jarkko Sakkinen wrote:
>> On Tue, Mar 08, 2022 at 10:14:42AM +0200, Jarkko Sakkinen wrote:
>>> On Mon, Mar 07, 2022 at 09:36:36AM -0800, Reinette Chatre wrote:
>>>> Hi Jarkko,
>>>>
>>>> On 3/7/2022 9:10 AM, Jarkko Sakkinen wrote:
>>>>> On Mon, Feb 07, 2022 at 04:45:28PM -0800, Reinette Chatre wrote:
>>>>>> === Summary ===
>>>>>>
>>>>>> An SGX VMA can only be created if its permissions are the same or
>>>>>> weaker than the Enclave Page Cache Map (EPCM) permissions. After VMA
>>>>>> creation this same rule is again enforced by the page fault handler:
>>>>>> faulted enclave pages are required to have equal or more relaxed
>>>>>> EPCM permissions than the VMA permissions.
>>>>>>
>>>>>> On SGX1 systems the additional enforcement in the page fault handler
>>>>>> is redundant and on SGX2 systems it incorrectly prevents access.
>>>>>> On SGX1 systems it is unnecessary to repeat the enforcement of the
>>>>>> permission rule. The rule used during original VMA creation will
>>>>>> ensure that any access attempt will use correct permissions.
>>>>>> With SGX2 the EPCM permissions of a page can change after VMA
>>>>>> creation resulting in the VMA permissions potentially being more
>>>>>> relaxed than the EPCM permissions and the page fault handler
>>>>>> incorrectly blocking valid access attempts.
>>>>>>
>>>>>> Enable the VMA's pages to remain accessible while ensuring that
>>>>>> the PTEs are installed to match the EPCM permissions but not be
>>>>>> more relaxed than the VMA permissions.
>>>>>>
>>>>>> === Full Changelog ===
>>>>>>
>>>>>> An SGX enclave is an area of memory where parts of an application
>>>>>> can reside. First an enclave is created and loaded (from
>>>>>> non-enclave memory) with the code and data of an application,
>>>>>> then user space can map (mmap()) the enclave memory to
>>>>>> be able to enter the enclave at its defined entry points for
>>>>>> execution within it.
>>>>>>
>>>>>> The hardware maintains a secure structure, the Enclave Page Cache Map
>>>>>> (EPCM), that tracks the contents of the enclave. Of interest here is
>>>>>> its tracking of the enclave page permissions. When a page is loaded
>>>>>> into the enclave its permissions are specified and recorded in the
>>>>>> EPCM. In parallel the kernel maintains permissions within the
>>>>>> page table entries (PTEs) and the rule is that PTE permissions
>>>>>> are not allowed to be more relaxed than the EPCM permissions.
>>>>>>
>>>>>> A new mapping (mmap()) of enclave memory can only succeed if the
>>>>>> mapping has the same or weaker permissions than the permissions that
>>>>>> were vetted during enclave creation. This is enforced by
>>>>>> sgx_encl_may_map() that is called on the mmap() as well as mprotect()
>>>>>> paths. This rule remains.
>>>>>>
>>>>>> One feature of SGX2 is to support the modification of EPCM permissions
>>>>>> after enclave initialization. Enclave pages may thus already be part
>>>>>> of a VMA at the time their EPCM permissions are changed resulting
>>>>>> in the VMA's permissions potentially being more relaxed than the EPCM
>>>>>> permissions.
>>>>>>
>>>>>> Allow permissions of existing VMAs to be more relaxed than EPCM
>>>>>> permissions in preparation for dynamic EPCM permission changes
>>>>>> made possible in SGX2.  New VMAs that attempt to have more relaxed
>>>>>> permissions than EPCM permissions continue to be unsupported.
>>>>>>
>>>>>> Reasons why permissions of existing VMAs are allowed to be more relaxed
>>>>>> than EPCM permissions instead of dynamically changing VMA permissions
>>>>>> when EPCM permissions change are:
>>>>>> 1) Changing VMA permissions involve splitting VMAs which is an
>>>>>>    operation that can fail. Additionally changing EPCM permissions of
>>>>>>    a range of pages could also fail on any of the pages involved.
>>>>>>    Handling these error cases causes problems. For example, if an
>>>>>>    EPCM permission change fails and the VMA has already been split
>>>>>>    then it is not possible to undo the VMA split nor possible to
>>>>>>    undo the EPCM permission changes that did succeed before the
>>>>>>    failure.
>>>>>> 2) The kernel has little insight into the user space where EPCM
>>>>>>    permissions are controlled from. For example, a RW page may
>>>>>>    be made RO just before it is made RX and splitting the VMAs
>>>>>>    while the VMAs may change soon is unnecessary.
>>>>>>
>>>>>> Remove the extra permission check called on a page fault
>>>>>> (vm_operations_struct->fault) or during debugging
>>>>>> (vm_operations_struct->access) when loading the enclave page from swap
>>>>>> that ensures that the VMA permissions are not more relaxed than the
>>>>>> EPCM permissions. Since a VMA could only exist if it passed the
>>>>>> original permission checks during mmap() and a VMA may indeed
>>>>>> have more relaxed permissions than the EPCM permissions this extra
>>>>>> permission check is no longer appropriate.
>>>>>>
>>>>>> With the permission check removed, ensure that PTEs do
>>>>>> not blindly inherit the VMA permissions but instead the permissions
>>>>>> that the VMA and EPCM agree on. PTEs for writable pages (from VMA
>>>>>> and enclave perspective) are installed with the writable bit set,
>>>>>> reducing the need for this additional flow to the permission mismatch
>>>>>> cases handled next.
>>>>>>
>>>>>> Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
>>>>>> ---
>>>>>> Changes since V1:
>>>>>> - Reword commit message (Jarkko).
>>>>>> - Use "relax" instead of "exceed" when referring to permissions (Dave).
>>>>>> - Add snippet to Documentation/x86/sgx.rst that highlights the
>>>>>>   relationship between VMA, EPCM, and PTE permissions on SGX
>>>>>>   systems (Andy).
>>>>>>
>>>>>>  Documentation/x86/sgx.rst      | 10 +++++++++
>>>>>>  arch/x86/kernel/cpu/sgx/encl.c | 38 ++++++++++++++++++----------------
>>>>>>  2 files changed, 30 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/x86/sgx.rst b/Documentation/x86/sgx.rst
>>>>>> index 89ff924b1480..5659932728a5 100644
>>>>>> --- a/Documentation/x86/sgx.rst
>>>>>> +++ b/Documentation/x86/sgx.rst
>>>>>> @@ -99,6 +99,16 @@ The relationships between the different permission masks are:
>>>>>>  * PTEs are installed to match the EPCM permissions, but not be more
>>>>>>    relaxed than the VMA permissions.
>>>>>>  
>>>>>> +On systems supporting SGX2 EPCM permissions may change while the
>>>>>> +enclave page belongs to a VMA without impacting the VMA permissions.
>>>>>> +This means that a running VMA may appear to allow access to an enclave
>>>>>> +page that is not allowed by its EPCM permissions. For example, when an
>>>>>> +enclave page with RW EPCM permissions is mapped by a RW VMA but is
>>>>>> +subsequently changed to have read-only EPCM permissions. The kernel
>>>>>> +continues to maintain correct access to the enclave page through the
>>>>>> +PTE that will ensure that only access allowed by both the VMA
>>>>>> +and EPCM permissions are permitted.
>>>>>> +
>>>>>>  Application interface
>>>>>>  =====================
>>>>>>  
>>>>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> index 48afe96ae0f0..b6105d9e7c46 100644
>>>>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>>>>> @@ -91,10 +91,8 @@ static struct sgx_epc_page *sgx_encl_eldu(struct sgx_encl_page *encl_page,
>>>>>>  }
>>>>>>  
>>>>>>  static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>> -						unsigned long addr,
>>>>>> -						unsigned long vm_flags)
>>>>>> +						unsigned long addr)
>>>>>>  {
>>>>>> -	unsigned long vm_prot_bits = vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>>  	struct sgx_epc_page *epc_page;
>>>>>>  	struct sgx_encl_page *entry;
>>>>>>  
>>>>>> @@ -102,14 +100,6 @@ static struct sgx_encl_page *sgx_encl_load_page(struct sgx_encl *encl,
>>>>>>  	if (!entry)
>>>>>>  		return ERR_PTR(-EFAULT);
>>>>>>  
>>>>>> -	/*
>>>>>> -	 * Verify that the faulted page has equal or higher build time
>>>>>> -	 * permissions than the VMA permissions (i.e. the subset of {VM_READ,
>>>>>> -	 * VM_WRITE, VM_EXECUTE} in vma->vm_flags).
>>>>>> -	 */
>>>>>> -	if ((entry->vm_max_prot_bits & vm_prot_bits) != vm_prot_bits)
>>>>>> -		return ERR_PTR(-EFAULT);
>>>>>> -
>>>>>>  	/* Entry successfully located. */
>>>>>>  	if (entry->epc_page) {
>>>>>>  		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>>>>> @@ -138,7 +128,9 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  {
>>>>>>  	unsigned long addr = (unsigned long)vmf->address;
>>>>>>  	struct vm_area_struct *vma = vmf->vma;
>>>>>> +	unsigned long page_prot_bits;
>>>>>>  	struct sgx_encl_page *entry;
>>>>>> +	unsigned long vm_prot_bits;
>>>>>>  	unsigned long phys_addr;
>>>>>>  	struct sgx_encl *encl;
>>>>>>  	vm_fault_t ret;
>>>>>> @@ -155,7 +147,7 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	mutex_lock(&encl->lock);
>>>>>>  
>>>>>> -	entry = sgx_encl_load_page(encl, addr, vma->vm_flags);
>>>>>> +	entry = sgx_encl_load_page(encl, addr);
>>>>>>  	if (IS_ERR(entry)) {
>>>>>>  		mutex_unlock(&encl->lock);
>>>>>   
>>>>>> @@ -167,7 +159,19 @@ static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
>>>>>>  
>>>>>>  	phys_addr = sgx_get_epc_phys_addr(entry->epc_page);
>>>>>>  
>>>>>> -	ret = vmf_insert_pfn(vma, addr, PFN_DOWN(phys_addr));
>>>>>> +	/*
>>>>>> +	 * Insert PTE to match the EPCM page permissions ensured to not
>>>>>> +	 * exceed the VMA permissions.
>>>>>> +	 */
>>>>>> +	vm_prot_bits = vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC);
>>>>>> +	page_prot_bits = entry->vm_max_prot_bits & vm_prot_bits;
>>>>>> +	/*
>>>>>> +	 * Add VM_SHARED so that PTE is made writable right away if VMA
>>>>>> +	 * and EPCM are writable (no COW in SGX).
>>>>>> +	 */
>>>>>> +	page_prot_bits |= (vma->vm_flags & VM_SHARED);
>>>>>> +	ret = vmf_insert_pfn_prot(vma, addr, PFN_DOWN(phys_addr),
>>>>>> +				  vm_get_page_prot(page_prot_bits));
>>>>>>  	if (ret != VM_FAULT_NOPAGE) {
>>>>>>  		mutex_unlock(&encl->lock);
>>>>>>  
>>>>>> @@ -295,15 +299,14 @@ static int sgx_encl_debug_write(struct sgx_encl *encl, struct sgx_encl_page *pag
>>>>>>   * Load an enclave page to EPC if required, and take encl->lock.
>>>>>>   */
>>>>>>  static struct sgx_encl_page *sgx_encl_reserve_page(struct sgx_encl *encl,
>>>>>> -						   unsigned long addr,
>>>>>> -						   unsigned long vm_flags)
>>>>>> +						   unsigned long addr)
>>>>>>  {
>>>>>>  	struct sgx_encl_page *entry;
>>>>>>  
>>>>>>  	for ( ; ; ) {
>>>>>>  		mutex_lock(&encl->lock);
>>>>>>  
>>>>>> -		entry = sgx_encl_load_page(encl, addr, vm_flags);
>>>>>> +		entry = sgx_encl_load_page(encl, addr);
>>>>>>  		if (PTR_ERR(entry) != -EBUSY)
>>>>>>  			break;
>>>>>>  
>>>>>> @@ -339,8 +342,7 @@ static int sgx_vma_access(struct vm_area_struct *vma, unsigned long addr,
>>>>>>  		return -EFAULT;
>>>>>>  
>>>>>>  	for (i = 0; i < len; i += cnt) {
>>>>>> -		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK,
>>>>>> -					      vma->vm_flags);
>>>>>> +		entry = sgx_encl_reserve_page(encl, (addr + i) & PAGE_MASK);
>>>>>>  		if (IS_ERR(entry)) {
>>>>>>  			ret = PTR_ERR(entry);
>>>>>>  			break;
>>>>>> -- 
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> If you unconditionally set vm_max_prot_bits to RWX for dynamically created
>>>>> pags, you would not need to do this.
>>>>>
>>>>> These patches could be then safely dropped then:
>>>>>
>>>>> - [PATCH V2 06/32] x86/sgx: Support VMA permissions more relaxed than enclave permissions 
>>>>> - [PATCH V2 08/32] x86/sgx: x86/sgx: Add sgx_encl_page->vm_run_prot_bits for dynamic permission changes
>>>>> - [PATCH V2 15/32] x86/sgx: Support relaxing of enclave page permissions
>>>>>
>>>>> And that would also keep full ABI compatibility without exceptions to the
>>>>> existing mainline code.
>>>>>
>>>>
>>>> Dropping these changes do not just impact dynamically created pages. Dropping
>>>> these patches would result in EPCM page permission restriction being supported
>>>> for all pages, those added before enclave initialization as well as dynamically
>>>> added pages, but their PTEs will not be impacted.
>>>>
>>>> For example, if a RW enclave page is added via SGX_IOC_ENCLAVE_ADD_PAGES and
>>>> then later made read-only via SGX_IOC_ENCLAVE_RESTRICT_PERMISSIONS then Linux
>>>> would keep allowing and installing RW PTEs to this page.
>>>
>>> I think that would be perfectly fine, if someone wants to do that. There is
>>> no corrateral damage on doing that. Kernel does not get messed because of
>>> that. It's a use case that does not make sense in the first place, so it'd
>>> be stupid to build anything extensive around it to the kernel.
>>>
>>> Shooting yourself to the foot is something that kernel does and should not
>>> protect user space from unless there is a risk of messing the state of the
>>> kernel itself.
>>>
>>> Much worse is that we have e.g. completely artificial ioctl
>>> SGX_IOC_ENCLAVE_RELAX_PERMISSIONS to support this scheme, which could e.g.
>>> cause extra roundtrips for simple EMODPE.
>>>
>>> Also this means not having to include 06/32, which keeps 100% backwards
>>> compatibility in run-time behaviour to the mainline while not restricting
>>> at all dynamically created pages. And we get rid of complex book keeping
>>> of vm_run_prot_bits.
>>>
>>> And generally the whole model is then very easy to understand and explain.
>>> If I had to keep presentation of the current mess in the patch set in a
>>> conference, I can honestly say that I would be in serious trouble. It's
>>> not clean and clear security model, which is a risk by itself.
>>
>> I.e.
>>
>> 1. For EADD'd pages: stick what has been the invariant 1,5 years now. Do
>>    not change it by any means (e.g. 06/32).
>> 2. For EAUG'd pages: set vm_max_prot_bits RWX, which essentially means do
>>    what ever you want with PTE's and EPCM.
>>
>> It's a clear and understandable model that does nothing bad to the kernel,
>> and a run-time developer can surely find away to get things on going. For
>> user space, the most important thing is the clarity in kernel behaviour,
>> and this does deliver that clarity. It's not perfect but it does do the
>> job and anyone can get it.
> 
> Also a quantitive argument for this is that by simplifying security model
> this way it is one ioctl less, which must be considered as +1. We do not
> want to add new ioctls unless it is something we absolutely cannnot live
> without. We absolutely can live without SGX_IOC_ENCLAVE_RELAX_PERMISSIONS.
> 

ok, with the implications understood and accepted I will proceed with a new
series that separates EPCM from PTEs and make RWX PTEs possible by default
for EAUG pages. This has broader impact than just removing
the three patches you list. "[PATCH 07/32] x86/sgx: Add pfn_mkwrite() handler
for present PTEs" is also no longer needed and there is no longer a need
to flush PTEs after restricting permissions. New changes also need to
be considered - at least the current documentation. I'll rework the series.

Reinette