On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote: > The SGX driver maintains a single global free page counter, > sgx_nr_free_pages, that reflects the number of free pages available > across all NUMA nodes. Correspondingly, a list of free pages is > associated with each NUMA node and sgx_nr_free_pages is updated > every time a page is added or removed from any of the free page > lists. The main usage of sgx_nr_free_pages is by the reclaimer > that will run when it (sgx_nr_free_pages) goes below a watermark > to ensure that there are always some free pages available to, for > example, support efficient page faults. > > With sgx_nr_free_pages accessed and modified from a few places > it is essential to ensure that these accesses are done safely but > this is not the case. sgx_nr_free_pages is read without any > protection and updated with inconsistent protection by any one > of the spin locks associated with the individual NUMA nodes. > For example: > > CPU_A CPU_B > ----- ----- > spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); > ... ... > sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; > > spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); I don't understand the "NOT SAFE" part here. It is safe to access the variable, even when it is not atomic... I don't understand what the sequence above should tell me. > The consequence of sgx_nr_free_pages not being protected is that > its value may not accurately reflect the actual number of free > pages on the system, impacting the availability of free pages in > support of many flows. The problematic scenario isu when the In non-atomicity is not a problem, when it is not a problem :-) > reclaimer does not run because it believes there to be sufficient > free pages while any attempt to allocate a page fails because there > are no free pages available. > > The worst scenario observed was a user space hang because of > repeated page faults caused by no free pages made available. > > The following flow was encountered: > asm_exc_page_fault > ... > sgx_vma_fault() > sgx_encl_load_page() > sgx_encl_eldu() // Encrypted page needs to be loaded from backing > // storage into newly allocated SGX memory page > sgx_alloc_epc_page() // Allocate a page of SGX memory > __sgx_alloc_epc_page() // Fails, no free SGX memory > ... > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer > wake_up(&ksgxd_waitq); > return -EBUSY; // Return -EBUSY giving reclaimer time to run > return -EBUSY; > return -EBUSY; > return VM_FAULT_NOPAGE; > > The reclaimer is triggered in above flow with the following code: > > static bool sgx_should_reclaim(unsigned long watermark) > { > return sgx_nr_free_pages < watermark && > !list_empty(&sgx_active_page_list); > } > > In the problematic scenario there were no free pages available yet the > value of sgx_nr_free_pages was above the watermark. The allocation of > SGX memory thus always failed because of a lack of free pages while no > free pages were made available because the reclaimer is never started > because of sgx_nr_free_pages' incorrect value. The consequence was that > user space kept encountering VM_FAULT_NOPAGE that caused the same > address to be accessed repeatedly with the same result. That causes sgx_should_reclaim() executed to be multiple times as the fault is retried. Eventually it should be successful. > Change the global free page counter to an atomic type that > ensures simultaneous updates are done safely. While doing so, move > the updating of the variable outside of the spin lock critical > section to which it does not belong. > > Cc: stable@xxxxxxxxxxxxxxx > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") > Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > Reviewed-by: Tony Luck <tony.luck@xxxxxxxxx> > Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx> I'm not yet sure if this a bug, even if it'd be a improvement to the performance. /Jarkko