On Thu, Oct 6, 2022 at 5:07 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > On Thu, Oct 6, 2022 at 2:57 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > On Thu, Oct 6, 2022 at 12:30 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > On Thu, Oct 6, 2022 at 8:32 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > On Thu, Oct 06, 2022 at 12:30:45AM -0700, Yosry Ahmed wrote: > > > > > On Wed, Oct 5, 2022 at 9:19 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > > > > > > > > > > > On Wed, Oct 05, 2022 at 03:13:38PM -0600, Yu Zhao wrote: > > > > > > > On Wed, Oct 5, 2022 at 3:02 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Wed, Oct 5, 2022 at 1:48 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > On Wed, Oct 5, 2022 at 11:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > > > > > During page/folio reclaim, we check if a folio is referenced using > > > > > > > > > > folio_referenced() to avoid reclaiming folios that have been recently > > > > > > > > > > accessed (hot memory). The rationale is that this memory is likely to be > > > > > > > > > > accessed soon, and hence reclaiming it will cause a refault. > > > > > > > > > > > > > > > > > > > > For memcg reclaim, we currently only check accesses to the folio from > > > > > > > > > > processes in the subtree of the target memcg. This behavior was > > > > > > > > > > originally introduced by commit bed7161a519a ("Memory controller: make > > > > > > > > > > page_referenced() cgroup aware") a long time ago. Back then, refaulted > > > > > > > > > > pages would get charged to the memcg of the process that was faulting them > > > > > > > > > > in. It made sense to only consider accesses coming from processes in the > > > > > > > > > > subtree of target_mem_cgroup. If a page was charged to memcg A but only > > > > > > > > > > being accessed by a sibling memcg B, we would reclaim it if memcg A is > > > > > > > > > > is the reclaim target. memcg B can then fault it back in and get charged > > > > > > > > > > for it appropriately. > > > > > > > > > > > > > > > > > > > > Today, this behavior still makes sense for file pages. However, unlike > > > > > > > > > > file pages, when swapbacked pages are refaulted they are charged to the > > > > > > > > > > memcg that was originally charged for them during swapping out. Which > > > > > > > > > > means that if a swapbacked page is charged to memcg A but only used by > > > > > > > > > > memcg B, and we reclaim it from memcg A, it would simply be faulted back > > > > > > > > > > in and charged again to memcg A once memcg B accesses it. In that sense, > > > > > > > > > > accesses from all memcgs matter equally when considering if a swapbacked > > > > > > > > > > page/folio is a viable reclaim target. > > > > > > > > > > > > > > > > > > > > Modify folio_referenced() to always consider accesses from all memcgs if > > > > > > > > > > the folio is swapbacked. > > > > > > > > > > > > > > > > > > It seems to me this change can potentially increase the number of > > > > > > > > > zombie memcgs. Any risk assessment done on this? > > > > > > > > > > > > > > > > Do you mind elaborating the case(s) where this could happen? Is this > > > > > > > > the cgroup v1 case in mem_cgroup_swapout() where we are reclaiming > > > > > > > > from a zombie memcg and swapping out would let us move the charge to > > > > > > > > the parent? > > > > > > > > > > > > > > The scenario is quite straightforward: for a page charged to memcg A > > > > > > > and also actively used by memcg B, if we don't ignore the access from > > > > > > > memcg B, we won't be able to reclaim it after memcg A is deleted. > > > > > > > > > > > > This patch changes the behavior of limit-induced reclaim. There is no > > > > > > limit reclaim on A after it's been deleted. And parental/global > > > > > > reclaim has always recognized outside references. > > > > > > > > > > Do you mind elaborating on the parental reclaim part? > > > > > > > > > > I am looking at the code and it looks like memcg reclaim of a parent > > > > > (limit-induced or proactive) will only consider references coming from > > > > > its subtree, even when reclaiming from its dead children. It looks > > > > > like as long as sc->target_mem_cgroup is set, we ignore outside > > > > > references (relative to sc->target_mem_cgroup). > > > > > > > > Yes, I was referring to outside of A. > > > > > > > > As of today, any siblings of A can already pin its memory after it's > > > > dead. I suppose your patch would add cousins to that list. It doesn't > > > > seem like a categorial difference to me. > > > > > > > > > If that is true, maybe we want to keep ignoring outside references for > > > > > swap-backed pages if the folio is charged to a dead memcg? My > > > > > understanding is that in this case we will uncharge the page from the > > > > > dead memcg and charge the swapped entry to the parent, reducing the > > > > > number of refs on the dead memcg. Without this check, this patch might > > > > > prevent the charge from being moved to the parent in this case. WDYT? > > > > > > > > I don't think it's worth it. Keeping the semantics simple and behavior > > > > predictable is IMO more valuable. > > > > > > > > It also wouldn't fix the scrape-before-rmdir issue Yu points out, > > > > which I think is the more practical concern. In light of that, it > > > > might be best to table the patch for now. (Until we have > > > > reparent-on-delete for anon and file pages...) > > > > > > If we add a mem_cgroup_online() check, we partially solve the problem. > > > Maybe scrape-before-rmdir will not reclaim those pages at once, but > > > the next time we try to reclaim from the dead memcg (global, limit, > > > proactive,..) we will reclaim the pages. So we will only be delaying > > > the freeing of those zombie memcgs. > > > > As an observer, this seems to be the death by a thousand cuts of the > > existing mechanism that Google has been using to virtually eliminate > > zombie memcgs for the last decade. > > > > I understand the desire to fix a specific problem with this patch. But > > it's methodically wrong to focus on specific problems without > > considering the large picture and how it's evolving. > > > > Our internal memory.reclaim, which is being superseded, is a superset > > of the mainline version. It has two flags relevant to this discussion: > > 1. hierarchical walk of a parent > > 2. target dead memcgs only > > With these, our job scheduler (Borg) doesn't need to scrape before > > rmdir at all. It does something called "applying root pressure", > > which, as one might imagine, is to write to the root memory.reclaim > > with the above flags. We have metrics on the efficiency of this > > mechanism and they are closely monitored. > > > > Why is this important? Because Murphy's law is generally true for a > > fleet when its scale and diversity is large and high enough. *We used > > to run out memcg IDs.* And we are still carrying a patch that enlarges > > swap_cgroup->id from unsigned short to unsigned int. > > > > Compared with the recharging proposal we have been discussing, the two > > cases that the above solution can't help: > > 1. kernel long pins > > 2. foreign mlocks > > But it's still *a lot* more reliable than the scrape-before-rmdir > > approach (or scrape-after-rmdir if we can hold the FD open before > > rmdir), because it offers unlimited retries and no dead memcgs, e.g., > > those created and deleted by jobs (not the job scheduler), can escape. > > > > Unless you can provide data, my past experience tells me that this > > patch will make scrape-before-rmdir unacceptable (in terms of > > effectiveness) to our fleet. Of course you can add additional code, > > i.e., those two flags or the offline check, which I'm not object to. > > I agree that the zombie memcgs problem is a serious problem that needs > to be dealt with, and recharging memory when a memcg is dying seems > like a promising direction. However, this patch's goal is to improve > reclaim of shared swapbacked memory in general, regardless of the > zombie memcgs problem. I understand that the current version affects > the zombie memcgs problem, but I believe this is an oversight that > needs to be fixed, not something that should make us leave the reclaim > problem unsolved. > > I think this patch + an offline check should be sufficient to fix the > reclaim problem while not regressing the zombie memcgs problem for > multiple reasons, see below. > > If we implement recharging memory of dying memcgs in the future, we > can always come back and remove the offline check. > > > Frankly, let me ask the real question: are you really sure this is the > > best for us and the rest of the community? > > Yes. This patch should improve reclaim of shared memory as I > elaborated in my previous email, and with an offline check I believe > we shouldn't be regressing the zombie memcgs problem whether for > Google or the community, for the following reasons. But it will regress a (non-root) parent who wishes to keep the hot memory it shared with its deleted children to itself. I think we should sleep on this.