On Mon, Mar 15, 2021 at 09:35:03AM -0700, Dave Hansen wrote: > On 3/13/21 8:01 AM, Jarkko Sakkinen wrote: > > Background > > ========== > > > > EPC section is covered by one or more SRAT entries that are associated with > > one and only one PXM (NUMA node). The motivation behind this patch is to > > provide basic elements of building allocation scheme based on this premise. > > A better set of background information here would also remind folks what > an 'EPC section' is. I don't actually know what exactly they are, because SDM's definition is not very rigorous ;-) I.e. it is not too precise between NUMA node and section relationship. > > > Just like normal RAM, enclave memory (EPC) should be covered by entries > > in the ACPI SRAT table. These entries allow each EPC section to be > > associated with a NUMA node. > > > > Use this information to implement a simple NUMA-aware allocator for > > enclave memory. > > SGX enclave memory is enumerated by the processor in contiguous physical > ranges called "EPC sections". Currently, there is a free list per > section, but allocations simply target the lowest-numbered sections. > This is functional, but has no NUMA awareness. > > Fortunately, EPC sections are covered by entries in the ACPI SRAT table. > These entries allow each EPC section to be associated with a NUMA node, > just like normal RAM. Thanks! > > Solution > > ======== > > > > Use phys_to_target_node() to associate each NUMA node with the EPC > > sections contained within its range. In sgx_alloc_epc_page(), first try > > to allocate from the NUMA node, where the CPU is executing. If that > > fails, allocate from other nodes, iterating them from the current node > > in order. > > To me, this is just telling us what the code does. It's not very > useful. I'd say: > > Implement a NUMA-aware enclave page allocator. Mirror the buddy > allocator and maintain a list of enclave pages for each NUMA node. > Attempt to allocate enclave memory first from local nodes, then fall > back to other nodes. > > Note that the fallback is not as sophisticated as the buddy allocator > and is itself not aware of NUMA distances. When a node's free list is > empty, it searches for the next-highest node with enclave pages (and > will wrap if necessary). This could be improved in the future. Thanks. > > Other > > ===== > > > > NUMA_KEEP_MEMINFO dependency is required for phys_to_target_node(). > > Reading the changelog, it's not obvious that you're talking about a > Kconfig variable here. > > I was also thinking, this says: > > # Keep arch NUMA mapping infrastructure post-init. > config NUMA_KEEP_MEMINFO > bool > > But we only need it during SGX init. Could you explain a bit why it's > needed here specifically? Superficially it seems like we only need this > info *during* init. Well, numa_meminfo is a static variable of mm/numa.c, so we cannot directly access it. And phys_to_numa_node(), which accesses its data, does not exist, unless NUMA_KEEP_MEMINFO is defined. So, yes, theoretically we could pull the data withotu NUMA_KEEP_MEMINFO, but thise would require to numa_meminfo a global variable. We do not care about post-init part, in that sense you are right. > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 513895af8ee7..3e6152a8dd2b 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -1930,6 +1930,7 @@ config X86_SGX > > depends on CRYPTO_SHA256=y > > select SRCU > > select MMU_NOTIFIER > > + select NUMA_KEEP_MEMINFO if NUMA > > help > > Intel(R) Software Guard eXtensions (SGX) is a set of CPU instructions > > that can be used by applications to set aside private regions of code > > diff --git a/arch/x86/kernel/cpu/sgx/main.c b/arch/x86/kernel/cpu/sgx/main.c > > index cb4561444b96..3b524a1361d6 100644 > > --- a/arch/x86/kernel/cpu/sgx/main.c > > +++ b/arch/x86/kernel/cpu/sgx/main.c > > @@ -18,14 +18,23 @@ static int sgx_nr_epc_sections; > > static struct task_struct *ksgxd_tsk; > > static DECLARE_WAIT_QUEUE_HEAD(ksgxd_waitq); > > > > -/* > > - * These variables are part of the state of the reclaimer, and must be accessed > > - * with sgx_reclaimer_lock acquired. > > - */ > > +/* The reclaimer lock protected variables prepend the lock. */ > > static LIST_HEAD(sgx_active_page_list); > > - > > static DEFINE_SPINLOCK(sgx_reclaimer_lock); > > > > +/* The free page list lock protected variables prepend the lock. */ > > +static unsigned long sgx_nr_free_pages; > > + > > +/* Nodes with one or more EPC sections. */ > > +static nodemask_t sgx_numa_mask; > > + > > +/* > > + * Array with one list_head for each possible NUMA node. Each > > + * list contains all the sgx_epc_section's which are on that > > + * node. > > + */ > > +static struct sgx_numa_node *sgx_numa_nodes; > > + > > /* > > * When the driver initialized, EPC pages go first here, as they could be > > * initialized to an active enclave, on kexec entry. > > @@ -352,21 +361,9 @@ static void sgx_reclaim_pages(void) > > } > > } > > > > -static unsigned long sgx_nr_free_pages(void) > > -{ > > - unsigned long cnt = 0; > > - int i; > > - > > - for (i = 0; i < sgx_nr_epc_sections; i++) > > - cnt += sgx_epc_sections[i].free_cnt; > > - > > - return cnt; > > -} > > - > > static bool sgx_should_reclaim(unsigned long watermark) > > { > > - return sgx_nr_free_pages() < watermark && > > - !list_empty(&sgx_active_page_list); > > + return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); > > } > > > > static int ksgxd(void *p) > > @@ -443,50 +440,63 @@ static bool __init sgx_page_reclaimer_init(void) > > return true; > > } > > > > -static struct sgx_epc_page *__sgx_alloc_epc_page_from_section(struct sgx_epc_section *section) > > +static struct sgx_epc_page *__sgx_alloc_epc_page_from_node(int nid) > > { > > - struct sgx_epc_page *page; > > + struct sgx_numa_node *node = &sgx_numa_nodes[nid]; > > + struct sgx_epc_page *page = NULL; > > + > > + if (!node_isset(nid, sgx_numa_mask)) > > + return NULL; > > I don't think this mask check is super necessary. It won't ever trigger > for the "fallback" path. For the node-local allocation, I guess it > saves a cacheline (node->free_page_list). > > > - spin_lock(§ion->lock); > > + spin_lock(&node->lock); > > > > - if (list_empty(§ion->page_list)) { > > - spin_unlock(§ion->lock); > > + if (list_empty(&node->free_page_list)) { > > + spin_unlock(&node->lock); > > return NULL; > > } > > > > - page = list_first_entry(§ion->page_list, struct sgx_epc_page, list); > > + page = list_first_entry(&node->free_page_list, struct sgx_epc_page, list); > > list_del_init(&page->list); > > - section->free_cnt--; > > + sgx_nr_free_pages--; > > + > > + spin_unlock(&node->lock); > > > > - spin_unlock(§ion->lock); > > return page; > > } > > > > /** > > * __sgx_alloc_epc_page() - Allocate an EPC page > > * > > - * Iterate through EPC sections and borrow a free EPC page to the caller. When a > > - * page is no longer needed it must be released with sgx_free_epc_page(). > > + * Iterate through NUMA nodes and borrow a free EPC page to the caller. When a > > "borrow" is a weird word to use here. > > > + * page is no longer needed it must be released with sgx_free_epc_page(). Start > > + * from the NUMA node, where the caller is executing. > > I don't think we need to tell folks how allocators work: that "free" and > "alloc" are opposites. > > > * Return: > > - * an EPC page, > > - * -errno on error > > + * - an EPC page: Free EPC pages were available. > > + * - ERR_PTR(-ENOMEM): Run out of EPC pages. > > */ > > I think returning NULL from a allocator is probably OK. Folks > understand that NULL means ENOMEM. > > > struct sgx_epc_page *__sgx_alloc_epc_page(void) > > { > > - struct sgx_epc_section *section; > > struct sgx_epc_page *page; > > - int i; > > + int nid = numa_node_id(); > > > > - for (i = 0; i < sgx_nr_epc_sections; i++) { > > - section = &sgx_epc_sections[i]; > > + /* Try to allocate EPC from the current node, first: */ > > + page = __sgx_alloc_epc_page_from_node(nid); > > + if (page) > > + return page; > > > > - page = __sgx_alloc_epc_page_from_section(section); > > + /* Then, go through the other nodes: */ > > + while (true) { > > + nid = next_node_in(nid, sgx_numa_mask); > > + if (nid == numa_node_id()) > > + break; > > One nit: This can _theoretically_ livelock. Preemption is enabled here > and numa_node_id() can change. It's theoretically possible that > numa_node_id()!=nid forever. > > The way to prevent this is to read numa_node_id() once and then compare > 'nid' against that single read. > > > + page = __sgx_alloc_epc_page_from_node(nid); > > if (page) > > - return page; > > + break; > > } > > > > - return ERR_PTR(-ENOMEM); > > + return page; > > } > > I guess you probably wrote it this way because you prefer it. But, this > can be written with a single call to __sgx_alloc_epc_page_from_node(). > > > /** > > @@ -592,6 +602,7 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) > > void sgx_free_epc_page(struct sgx_epc_page *page) > > { > > struct sgx_epc_section *section = &sgx_epc_sections[page->section]; > > + struct sgx_numa_node *node = section->node; > > int ret; > > > > WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED); > > @@ -600,10 +611,12 @@ void sgx_free_epc_page(struct sgx_epc_page *page) > > if (WARN_ONCE(ret, "EREMOVE returned %d (0x%x)", ret, ret)) > > return; > > > > - spin_lock(§ion->lock); > > - list_add_tail(&page->list, §ion->page_list); > > - section->free_cnt++; > > - spin_unlock(§ion->lock); > > + spin_lock(&node->lock); > > + > > + list_add_tail(&page->list, &node->free_page_list); > > + sgx_nr_free_pages++; > > + > > + spin_unlock(&node->lock); > > } > > > > static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, > > @@ -624,8 +637,6 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, > > } > > > > section->phys_addr = phys_addr; > > - spin_lock_init(§ion->lock); > > - INIT_LIST_HEAD(§ion->page_list); > > > > for (i = 0; i < nr_pages; i++) { > > section->pages[i].section = index; > > @@ -634,7 +645,7 @@ static bool __init sgx_setup_epc_section(u64 phys_addr, u64 size, > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); > > } > > > > - section->free_cnt = nr_pages; > > + sgx_nr_free_pages += nr_pages; > > return true; > > } > > > > @@ -653,8 +664,11 @@ static bool __init sgx_page_cache_init(void) > > { > > u32 eax, ebx, ecx, edx, type; > > u64 pa, size; > > + int nid; > > int i; > > > > + sgx_numa_nodes = kmalloc_array(num_possible_nodes(), sizeof(*sgx_numa_nodes), GFP_KERNEL); > > Needs a NULL check. Ugh, lol, I did spot this but forgot to fix it. Darn, well, I'll fix it for the next version :-) > > > for (i = 0; i < ARRAY_SIZE(sgx_epc_sections); i++) { > > cpuid_count(SGX_CPUID, i + SGX_CPUID_EPC, &eax, &ebx, &ecx, &edx); > > > > @@ -677,6 +691,21 @@ static bool __init sgx_page_cache_init(void) > > break; > > } > > > > + nid = numa_map_to_online_node(phys_to_target_node(pa)); > > + if (nid == NUMA_NO_NODE) { > > + /* The physical address is already printed above. */ > > + pr_warn(FW_BUG "Unable to map EPC section to online node. Fallback to the NUMA node 0.\n"); > > + nid = 0; > > + } > > + > > + if (!node_isset(nid, sgx_numa_mask)) { > > + spin_lock_init(&sgx_numa_nodes[nid].lock); > > + INIT_LIST_HEAD(&sgx_numa_nodes[nid].free_page_list); > > + node_set(nid, sgx_numa_mask); > > + } > > + > > + sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; > > + > > sgx_nr_epc_sections++; > > } > > The rest looks OK. > /Jarkko