Hi Greg,
On 11/4/2021 11:54 AM, Greg KH wrote:
On Thu, Nov 04, 2021 at 11:28:54AM -0700, Reinette Chatre wrote:
The SGX driver maintains a single global free page counter,
sgx_nr_free_pages, that reflects the number of free pages available
across all NUMA nodes. Correspondingly, a list of free pages is
associated with each NUMA node and sgx_nr_free_pages is updated
every time a page is added or removed from any of the free page
lists. The main usage of sgx_nr_free_pages is by the reclaimer
that will run when the total free pages go below a watermark to
ensure that there are always some free pages available to, for
example, support efficient page faults.
With sgx_nr_free_pages accessed and modified from a few places
it is essential to ensure that these accesses are done safely but
this is not the case. sgx_nr_free_pages is sometimes accessed
without any protection and when it is protected it is done
inconsistently with any one of the spin locks associated with the
individual NUMA nodes.
The consequence of sgx_nr_free_pages not being protected is that
its value may not accurately reflect the actual number of free
pages on the system, impacting the availability of free pages in
support of many flows. The problematic scenario is when the
reclaimer never runs because it believes there to be sufficient
free pages while any attempt to allocate a page fails because there
are no free pages available. The worst scenario observed was a
user space hang because of repeated page faults caused by
no free pages ever made available.
Change the global free page counter to an atomic type that
ensures simultaneous updates are done safely. While doing so, move
the updating of the variable outside of the spin lock critical
section to which it does not belong.
Cc: stable@xxxxxxxxxxxxxxx
Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()")
Suggested-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
Signed-off-by: Reinette Chatre <reinette.chatre@xxxxxxxxx>
---
arch/x86/kernel/cpu/sgx/main.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c
index 63d3de02bbcc..8558d7d5f3e7 100644
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@ -28,8 +28,7 @@ static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq);
static LIST_HEAD(sgx_active_page_list);
static DEFINE_SPINLOCK(sgx_reclaimer_lock);
-/* The free page list lock protected variables prepend the lock. */
-static unsigned long sgx_nr_free_pages;
+atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
/* Nodes with one or more EPC sections. */
static nodemask_t sgx_numa_mask;
@@ -403,14 +402,15 @@ static void sgx_reclaim_pages(void)
spin_lock(&node->lock);
list_add_tail(&epc_page->list, &node->free_page_list);
- sgx_nr_free_pages++;
spin_unlock(&node->lock);
+ atomic_long_inc(&sgx_nr_free_pages);
}
}
static bool sgx_should_reclaim(unsigned long watermark)
{
- return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list);
+ return atomic_long_read(&sgx_nr_free_pages) < watermark &&
+ !list_empty(&sgx_active_page_list);
Thank you very much for taking a look.
What prevents the value from changing right after you test this?
You are correct. It is indeed possible for the value to change after
this test. This test decides when to reclaim more pages so an absolute
accurate value is not required but knowing that the amount of free pages
are running low is.
Why is
an atomic value somehow solving the problem?
During stress testing it was found that page allocation during hot path
(for example page fault) is failing because there were no free pages
available in any of the free page lists while the global counter
contained a value that does not reflect the actual free pages and was
higher than the watermark and thus the reclaimer is never run.
Changing it to atomic fixed this issue. I also reverted to how this
counter was managed before with a spin lock protected free counter per
free list and that also fixes the issue.
The value changes were happening safely, it was just the reading of the
value that was not. You have not changed the fact that the value can
change right after reading given that there was not going to be a
problem with reading a stale value before.
I am testing on a two socket system and I am seeing that the value of
sgx_nr_free_pages does not accurately reflect the actual free pages.
It does not look to me like the value is updated safely as it is updated
with inconsistent protection on a system like this. There is a spin lock
associated with each NUMA node and which lock is used to update the
variable depends on which NUMA node memory is being modified - it is not
always protected with the same lock:
spin_lock(&node->lock);
sgx_nr_free_pages++;
spin_unlock(&node->lock);
In other words, what did you really fix here? And how did you test it
to verify it did fix things?
To test this I created a stress test that builds on top of the new
"oversubscription" test case that we are trying to add to the SGX
kselftests:
https://lore.kernel.org/lkml/7715db4882ab9fd52d21de6f62bb3b7e94dc4885.1635447301.git.reinette.chatre@xxxxxxxxx/
In the changed test an enclave is created with as much heap as SGX
memory. After that all the pages are accessed, their type is changed,
then the enclave is entered to run EACCEPT on each page, after that the
pages are removed (EREMOVE).
This test places significant stress on the SGX memory allocation and
reclaim subsystems. The troublesome part of the test is when the enclave
is entered so that EACCEPT can be run on each page. During this time,
because of the oversubscription and previous accesses, there are many
page faults. During this time a new page needs to be allocated in the
SGX memory into which the page being faulted needs to be decrypted and
loaded back into SGX memory. At this point the test hangs.
Below I show the following:
* perf top showing that the test hangs because user space is stuck just
encountering page faults
* below that I show the code traces explaining why the repeated page
faults occur (because reclaimer never runs)
* below that shows the perf top traces after this patch is applied
showing that the reclaimer now gets a chance to run and the test can
complete
Here is the perf top trace before this patch is applied showing how user
space is stuck hitting page faults over and over:
PerfTop: 4569 irqs/sec kernel:25.0% exact: 100.0% lost: 0/0
drop: 0/0 [4000Hz cycles], (all, 224 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------
94.34% 68.51% [vdso] [.] __vdso_sgx_enter_enclave
|
|--68.51%--0x5541f689495641d7
| __libc_start_main
| main
| test_harness_run
| __run_test
|
wrapper_enclave_unclobbered_vdso_oversubscribed_remove
| enclave_unclobbered_vdso_oversubscribed_remove
| __vdso_sgx_enter_enclave
| |
| --68.30%--asm_exc_page_fault
|
--7.84%--__vdso_sgx_enter_enclave
|
--2.58%--asm_exc_page_fault
|
--2.72%--exc_page_fault
|
--3.67%--do_user_addr_fault
|
--6.96%--handle_mm_fault
|
|--4.37%--__handle_mm_fault
|
|
|
--1.65%--__do_fault
|
|
|
--2.66%--sgx_vma_fault
|
|
|
|--1.93%--xa_load
|
| |
|
| --1.85%--xas_load
|
|
|
--1.21%--sgx_encl_load_page
|
--1.35%--__count_memcg_events
85.55% 0.17% [kernel] [k] asm_exc_page_fault
|
--70.81%--asm_exc_page_fault
|
--2.73%--exc_page_fault
|
--3.71%--do_user_addr_fault
|
--7.00%--handle_mm_fault
|
|--4.42%--__handle_mm_fault
| |
|
--1.65%--__do_fault
|
|
|
--2.66%--sgx_vma_fault
|
|
|
|--1.93%--xa_load
|
| |
|
| --1.85%--xas_load
|
|
|
--1.21%--sgx_encl_load_page
|
--1.35%--__count_memcg_events
16.83% 0.18% [kernel] [k] exc_page_fault
|
--2.57%--exc_page_fault
|
--3.69%--do_user_addr_fault
|
--7.00%--handle_mm_fault
|
|--4.42%--__handle_mm_fault
| |
|
--1.65%--__do_fault
| |
|
--2.66%--sgx_vma_fault
|
|
|
|--1.93%--xa_load
|
| |
|
| --1.85%--xas_load
|
|
|
--1.21%--sgx_encl_load_page
|
--1.35%--__count_memcg_events
exiting.
What happens above is the following flow is encountered:
sgx_vma_fault()
sgx_encl_load_page()
sgx_encl_eldu() //page needs to be loaded from swap
sgx_alloc_epc_page(..., false) // new EPC page is needed
Taking a closer look at sgx_alloc_epc_page:
struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim)
{
struct sgx_epc_page *page;
for ( ; ; ) {
page = __sgx_alloc_epc_page(); <--- page == NULL because of no free pages
if (!IS_ERR(page)) {
page->owner = owner;
break;
}
if (list_empty(&sgx_active_page_list)) <--- there are plenty of pages
that can be reclaimed
return ERR_PTR(-ENOMEM);
if (!reclaim) { <--- reclaim is false so EBUSY will be returned but an
attempt is made to wake the reclaimer below
page = ERR_PTR(-EBUSY);
break;
}
if (signal_pending(current)) {
page = ERR_PTR(-ERESTARTSYS);
break;
}
sgx_reclaim_pages();
cond_resched();
}
if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) <-- expected to wake up
reclaimer but this is not happening
wake_up(&ksgxd_waitq);
return page;
}
Because the above returns EBUSY sgx_vma_fault() will return
VM_FAULT_NOPAGE and user space will keep attempting to access the same
page and trigger the same fault again because there still are no free pages.
static vm_fault_t sgx_vma_fault(struct vm_fault *vmf)
{
...
entry = sgx_encl_load_page(encl, addr);
if (IS_ERR(entry)) {
mutex_unlock(&encl->lock);
if (PTR_ERR(entry) == -EBUSY)
return VM_FAULT_NOPAGE;
}
...
}
I added some tracing and it shows that the value of sgx_nr_free_pages
was higher than the watermark and thus the reclaimer does not free up
pages, yet the allocation of memory keeps failing because there are no
more free pages.
With this patch the test is able to complete and the tracing shows that
the reclaimer is getting a chance to run. Previously the system was
spending almost all its time in page faults but now the time is split
between the page faults and the reclaimer.
PerfTop: 7432 irqs/sec kernel:81.5% exact: 100.0% lost: 0/0
drop: 0/0 [4000Hz cycles], (all, 224 CPUs)
------------------------------------------------------------------------------------------------------------------------------------------------------------
49.62% 0.18% test_sgx [.]
enclave_unclobbered_vdso_oversubscribed_remove
|
--14.59%--enclave_unclobbered_vdso_oversubscribed_remove
|
--20.19%--__vdso_sgx_enter_enclave
|
--11.23%--asm_exc_page_fault
|
--4.82%--exc_page_fault
|
--5.04%--do_user_addr_fault
|
--5.33%--handle_mm_fault
|
--4.98%--__handle_mm_fault
|
--4.59%--__do_fault
|
--5.71%--sgx_vma_fault
|
|--16.71%--sgx_encl_load_page
| |
| --17.05%--sgx_encl_eldu
|
--5.31%--__mutex_lock.isra.9
|
--4.92%--mutex_spin_on_owner
49.46% 14.41% [vdso] [.] __vdso_sgx_enter_enclave
|
|--5.81%--__vdso_sgx_enter_enclave
| |
| --4.83%--asm_exc_page_fault
| |
| --4.82%--exc_page_fault
| |
| --5.04%--do_user_addr_fault
| |
|
--5.33%--handle_mm_fault
| |
|
--4.98%--__handle_mm_fault
|
|
|
--4.59%--__do_fault
|
|
|
--5.71%--sgx_vma_fault
|
|
|
|--16.71%--sgx_encl_load_page
|
| |
|
| --17.05%--sgx_encl_eldu
|
|
|
--5.31%--__mutex_lock.isra.9
|
|
|
--4.92%--mutex_spin_on_owner
|
--2.10%--0x5541f689495641d7
__libc_start_main
main
test_harness_run
__run_test
wrapper_enclave_unclobbered_vdso_oversubscribed_remove
|
--8.90%--enclave_unclobbered_vdso_oversubscribed_remove
|
--14.39%--__vdso_sgx_enter_enclave
|
--6.39%--asm_exc_page_fault
45.60% 0.05% [kernel] [k] ksgxd
|
--10.27%--ksgxd
|
--13.34%--sgx_reclaim_pages
|
|--19.74%--sgx_encl_ewb
| |
| --18.49%--__sgx_encl_ewb
|
--15.73%--__mutex_lock.isra.9
|
--14.77%--mutex_spin_on_owner
exiting.
Reinette