On 18/06/2024 12:53 am, Huang, Haitao wrote: > From: Kristen Carlson Accardi <kristen@xxxxxxxxxxxxxxx> > > Currently in the EPC page allocation, the kernel simply fails the > allocation when the current EPC cgroup fails to charge due to its usage > reaching limit. This is not ideal. When that happens, a better way is > to reclaim EPC page(s) from the current EPC cgroup (and/or its > descendants) to reduce its usage so the new allocation can succeed. > > Add the basic building blocks to support per-cgroup reclamation. > > Currently the kernel only has one place to reclaim EPC pages: the global > EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU > list for each EPC cgroup, and introduce a "cgroup" variant function to > reclaim EPC pages from a given EPC cgroup and its descendants. > > Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). > It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) > pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN > pages from the global LRU, and then tries to reclaim these pages at once > for better performance. > > Implement the "cgroup" variant EPC reclaim in a similar way, but keep > the implementation simple: 1) change sgx_reclaim_pages() to take an LRU > as input, and return the pages that are "scanned" and attempted for > reclamation (but not necessarily reclaimed successfully); 2) loop the > given EPC cgroup and its descendants and do the new sgx_reclaim_pages() > until SGX_NR_TO_SCAN pages are "scanned". > > This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always > tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC > cgroup, and only moves to its descendants when there's no enough > reclaimable EPC pages to "scan" in its LRU. It should be enough for > most cases. [...] > In other cases, the caller may invoke this function in a > loop to ensure enough pages reclaimed for its usage. To ensure all > descendant groups scanned in a round-robin fashion in those cases, > sgx_cgroup_reclaim_pages() takes in a starting cgroup and returns the > next cgroup that the caller can pass in as the new starting cgroup for a > subsequent call. AFAICT this part is new, and I believe this "round-robin" thing is just for the "global reclaim"? Or is it also for per-cgroup reclaim where more than SGX_NR_TO_SCAN pages needs to be reclaimed? I wish the changelog should just point out what consumers will use this new sgx_cgroup_reclaim_pages(), like: The sgx_cgroup_reclaim_pages() will be used in three cases: 1) direct/sync per-cgroup reclaim in try_charge() 2) indirect/async per-cgroup reclaim triggered in try_charge() 3) global reclaim And then describe how will they use sgx_cgroup_reclaim_pages(): Both 1) and 2) can result in needing to reclaim more than SGX_NR_TO_SCAN pages, in which case we should <fill in how to reclaim>. For 3), the new global reclaim should try tot match the existing global reclaim behaviour, that is to try to treat all EPC pages equally. <continue to explain how can sgx_cgroup_reclaim_pages() achieve this.> With above context, we can justify why to make sgx_cgroup_reclaim_pages() in this form. > > Note, this simple implementation doesn't _exactly_ mimic the current > global EPC reclaim (which always tries to do the actual reclaim in batch > of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN > reclaimable pages, the actual reclaim of EPC pages will be split into > smaller batches _across_ multiple LRUs with each being smaller than > SGX_NR_TO_SCAN pages. > > A more precise way to mimic the current global EPC reclaim would be to > have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages > _across_ the given EPC cgroup _AND_ its descendants, and then do the > actual reclaim in one batch. But this is unnecessarily complicated at > this stage. > > Alternatively, the current sgx_reclaim_pages() could be changed to > return the actual "reclaimed" pages, but not "scanned" pages. However, > the reclamation is a lengthy process, forcing a successful reclamation > of predetermined number of pages may block the caller for too long. And > that may not be acceptable in some synchronous contexts, e.g., in > serving an ioctl(). > > With this building block in place, add synchronous reclamation support > in sgx_cgroup_try_charge(): trigger a call to > sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the > caller allows synchronous reclaim as indicated by s newly added > parameter. > > A later patch will add support for asynchronous reclamation reusing > sgx_cgroup_reclaim_pages(). It seems you also should mention the new global reclaim will also use this sgx_cgroup_reclaim_pages()? [...] > +/** > + * sgx_cgroup_reclaim_pages() - reclaim EPC from a cgroup tree > + * @root: The root of cgroup tree to reclaim from. > + * @start: The descendant cgroup from which to start the tree walking. > + * > + * This function performs a pre-order walk in the cgroup tree under the given > + * root, starting from the node %start, or from the root if %start is NULL. The > + * function will attempt to reclaim pages at each node until a fixed number of > + * pages (%SGX_NR_TO_SCAN) are attempted for reclamation. No guarantee of > + * success on the actual reclamation process. In extreme cases, if all pages in > + * front of the LRUs are recently accessed, i.e., considered "too young" to > + * reclaim, no page will actually be reclaimed after walking the whole tree. > + * > + * In some cases, a caller may want to ensure enough reclamation until its > + * specific need is met. In those cases, the caller should invoke this function > + * in a loop, and at each iteration passes in the same root and the next node > + * returned from the previous call as the new %start. > + * > + * Return: The next misc cgroup in the subtree to continue the scanning and > + * attempt for more reclamation from this subtree if needed. > [...] > Caller must > + * release the reference if the returned is not used as %start for a subsequent > + * call. > This sentence isn't clear to me. First of all, release the reference "of what"? The %start, or the one returned by this function? And is it because of ... > + */ > +static struct misc_cg *sgx_cgroup_reclaim_pages(struct misc_cg *root, struct misc_cg *start) > +{ > + struct cgroup_subsys_state *css_root, *pos; > + struct cgroup_subsys_state *next = NULL; > + struct sgx_cgroup *sgx_cg; > + unsigned int cnt = 0; > + > + /* Caller must ensure css_root and start ref's acquired */ ... the caller must acquire the ref of both @css_root and @css_start, and ... > + css_root = &root->css; > + if (start) > + pos = &start->css; > + else > + pos = css_root; > + > + while (cnt < SGX_NR_TO_SCAN) { > + sgx_cg = sgx_cgroup_from_misc_cg(css_misc(pos)); > + cnt += sgx_reclaim_pages(&sgx_cg->lru); > + > + rcu_read_lock(); > + > + next = css_next_descendant_pre(pos, css_root); > + > + if (pos != css_root) > + css_put(pos); ... the ref is decreased internally? > + > + if (!next || !css_tryget(next)) { > + /* We are done if next is NULL or not safe to continue > + * the walk if next is dead. Return NULL and the caller > + * determines whether to restart from root. > + */ Incorrect comment style. > + rcu_read_unlock(); > + return NULL; > + } > + > + rcu_read_unlock(); > + pos = next; There's no ref grab here, wouldn't the above ... if (pos != css_root) css_put(pos); ... decrease the ref w/o having it been increased? > + } > + > + return css_misc(next); Here AFAICT the ref isn't increased, but ... [...] > +/** > + * sgx_cgroup_try_charge() - try to charge cgroup for a single EPC page > * @sgx_cg: The EPC cgroup to be charged for the page. > + * @reclaim: Whether or not synchronous EPC reclaim is allowed. > * Return: > * * %0 - If successfully charged. > * * -errno - for failures. > */ > -int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) > +int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) > { > - return misc_cg_try_charge(MISC_CG_RES_SGX_EPC, sgx_cg->cg, PAGE_SIZE); > + int ret; > + struct misc_cg *cg_next = NULL; > + > + for (;;) { > + ret = __sgx_cgroup_try_charge(sgx_cg); > + > + if (ret != -EBUSY) > + goto out; > + > + if (reclaim == SGX_NO_RECLAIM) { > + ret = -ENOMEM; > + goto out; > + } > + > + cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); > + cond_resched(); > + } > + > +out: > + if (cg_next != sgx_cg->cg) > + put_misc_cg(cg_next); ... if I am reading correctly, here you does the put anyway. > + return ret; > } > And when there are more than SGX_NR_TO_SCAN pages that need to reclaim, the above ... for (;;) { cg_next = sgx_cgroup_reclaim_pages(sgx_cg->cg, cg_next); } ... actually tries to reclaim those pages from @sgx_cg _AND_ it's descendants, and tries to do it _EQUALLY_. Is this desired, or should we always try to reclaim from the @sgx_cg first, but only moves to the desendants when the @sgx_cg shouldn't be reclaimed anymore? Anyway, it's different from the previous behaviour. [...] > -static bool sgx_should_reclaim(unsigned long watermark) > +static bool sgx_should_reclaim_global(unsigned long watermark) > { > return atomic_long_read(&sgx_nr_free_pages) < watermark && > !list_empty(&sgx_global_lru.reclaimable); > } > > +static void sgx_reclaim_pages_global(void) > +{ > + sgx_reclaim_pages(&sgx_global_lru); > +} > + > /* > * sgx_reclaim_direct() should be called (without enclave's mutex held) > * in locations where SGX memory resources might be low and might be > @@ -394,8 +405,8 @@ static bool sgx_should_reclaim(unsigned long watermark) > */ > void sgx_reclaim_direct(void) > { > - if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) > - sgx_reclaim_pages(); > + if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES)) > + sgx_reclaim_pages_global(); > } > I wish the rename was mentioned in the changelog too.