Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Fri, 10 May 2024 16:47:50 -0700

Hi Dmitrii,

Thank you very much for uncovering and fixing this issue.

On 4/30/2024 7:38 AM, Dmitrii Kuvaiskii wrote:
> On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
>> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
>>> Two enclave threads may try to add and remove the same enclave page
>>> simultaneously (e.g., if the SGX runtime supports both lazy allocation
>>> and `MADV_DONTNEED` semantics). Consider this race:
>>>
>>> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
>>>    after removing the page table entry and right before re-acquiring the
>>>    enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
>>> 2. T2 tries to access the page, and #PF[not_present] is raised. The
>>>    condition to EAUG in sgx_vma_fault() is not satisfied because the
>>>    page is still present in encl->page_array, thus the SGX driver
>>>    assumes that the fault happened because the page was swapped out. The
>>>    driver continues on a code path that installs a page table entry
>>>    *without* performing EAUG.
>>> 3. The enclave page metadata is in inconsistent state: the PTE is
>>>    installed but there was no EAUG. Thus, T2 in userspace infinitely
>>>    receives SIGSEGV on this page (and EACCEPT always fails).
>>>
>>> Fix this by making sure that T1 (the page-removing thread) always wins
>>> this data race. In particular, the page-being-removed is marked as such,
>>> and T2 retries until the page is fully removed.
>>>
>>> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@xxxxxxxxx>
>>> ---
>>>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>>>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>>>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>>>  3 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>> index 41f14b1a3025..7ccd8b2fce5f 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>>>  
>>>  	/* Entry successfully located. */
>>>  	if (entry->epc_page) {
>>> -		if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>> +		if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
>>> +				   SGX_ENCL_PAGE_BEING_REMOVED))
>>>  			return ERR_PTR(-EBUSY);
>>>  
>>>  		return entry;
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
>>> index f94ff14c9486..fff5f2293ae7 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.h
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.h
>>> @@ -25,6 +25,9 @@
>>>  /* 'desc' bit marking that the page is being reclaimed. */
>>>  #define SGX_ENCL_PAGE_BEING_RECLAIMED	BIT(3)
>>>  
>>> +/* 'desc' bit marking that the page is being removed. */
>>> +#define SGX_ENCL_PAGE_BEING_REMOVED	BIT(2)
>>> +
>>>  struct sgx_encl_page {
>>>  	unsigned long desc;
>>>  	unsigned long vm_max_prot_bits:8;
>>> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> index b65ab214bdf5..c542d4dd3e64 100644
>>> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>>>  		 * Do not keep encl->lock because of dependency on
>>>  		 * mmap_lock acquired in sgx_zap_enclave_ptes().
>>>  		 */
>>> +		entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>>>  		mutex_unlock(&encl->lock);
>>>  
>>>  		sgx_zap_enclave_ptes(encl, addr);
>>
>> It is somewhat trivial to NAK this as the commit message does
>> not do any effort describing the new flag. By default at least
>> I have strong opposition against any new flags related to
>> reclaiming even if it needs a bit of extra synchronization
>> work in the user space.
>>
>> One way to describe concurrency scenarios would be to take
>> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
>>
>> I.e. see the examples with CPU 1 and CPU 2.
> 
> Thank you for the suggestion. Here is my new attempt at describing the racy
> scenario:
> 
> Consider some enclave page added to the enclave. User space decides to
> temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
> on CPU1. At the same time, user space performs a memory access on the same
> page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
> Scenario proceeds as follows:
> 
> /*
>  * CPU1: User space performs
>  * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
>  * on a single enclave page
>  */
> sgx_encl_remove_pages() {
> 
>   mutex_lock(&encl->lock);
> 
>   entry = sgx_encl_load_page(encl);
>   /*
>    * verify that page is
>    * trimmed and accepted
>    */
> 
>   mutex_unlock(&encl->lock);
> 
>   /*
>    * remove PTE entry; cannot
>    * be performed under lock
>    */
>   sgx_zap_enclave_ptes(encl);
>                                    /*
>                                     * Fault on CPU2
>                                     */

Please highlight that this fault is related to the page that
is in process of being removed on CPU1.

>                                    sgx_vma_fault() {
>                                      /*
>                                       * PTE entry was removed, but the
>                                       * page is still in enclave's xarray
>                                       */
>                                      xa_load(&encl->page_array) != NULL ->
>                                      /*
>                                       * SGX driver thinks that this page
>                                       * was swapped out and loads it
>                                       */
>                                      mutex_lock(&encl->lock);
>                                      /*
>                                       * this is effectively a no-op
>                                       */
>                                      entry = sgx_encl_load_page_in_vma();
>                                      /*
>                                       * add PTE entry
>                                       */

It may be helpful to highlight that this is a problem: "BUG: A PTE
is installed for a page in process of being removed." (please feel free
to expand)

>                                      vmf_insert_pfn(...);
> 
>                                      mutex_unlock(&encl->lock);
>                                      return VM_FAULT_NOPAGE;
>                                    }
>   /*
>    * continue with page removal
>    */
>   mutex_lock(&encl->lock);
> 
>   sgx_encl_free_epc_page(epc_page) {
>     /*
>      * remove page via EREMOVE
>      */
>     /*
>      * free EPC page
>      */
>     sgx_free_epc_page(epc_page);
>   }
> 
>   xa_erase(&encl->page_array);
> 
>   mutex_unlock(&encl->lock);
> }
> 
> CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed. Because of this
> combination, any subsequent access to this page raises a fault, and the #PF
> handler sees the SGX bit set in the #PF error code and does not call

Which #PF handler?

> sgx_vma_fault() but instead raises a SIGSEGV. The userspace SIGSEGV handler
> cannot perform EACCEPT because the page was not EAUGed. Thus, the user
> space is stuck with the inaccessible page.
> 
> This race can be fixed by forcing the fault handler on CPU2 to back off if
> the page is currently being removed (on CPU1). Thus a simple change is to
> introduce a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by default
> and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until the
> page is completely removed. After that, any memory access to this page
> results in a normal "allocate and EAUG a page on #PF" flow.

I have been tripped by these page flags before so would appreciate
another opinion. From my side this looks like an appropriate fix.

Reinette