Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list

Yang Shi <shy828301@xxxxxxxxx> · Mon, 15 Apr 2024 14:16:15 -0700

On Mon, Apr 15, 2024 at 12:19 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> >>
> >> We could have
> >> * THP_DEFERRED_SPLIT_PAGE
> >> * THP_UNDO_DEFERRED_SPLIT_PAGE
> >> * THP_PERFORM_DEFERRED_SPLIT_PAGE
> >>
> >> Maybe that would catch more cases (not sure if all, though). Then, you
> >> could tell how many are still on that list. THP_DEFERRED_SPLIT_PAGE -
> >> THP_UNDO_DEFERRED_SPLIT_PAGE - THP_PERFORM_DEFERRED_SPLIT_PAGE.
> >>
> >> That could give one a clearer picture how deferred split interacts with
> >> actual splitting (possibly under memory pressure), the whole reason why
> >> deferred splitting was added after all.
> >
> > I'm not quite sure whether there is a solid usecase or not. If we
> > have, we could consider this. But a simpler counter may be more
> > preferred.
>
> Yes.
>
> >
> >>
> >>> It may be useful. However the counter is typically used to estimate
> >>> how many THP are partially unmapped during a period of time.
> >>
> >> I'd say it's a bit of an abuse of that counter; well, or interpreting
> >> something into the counter that that counter never reliably represented.
> >
> > It was way more reliable than now.
>
> Correct me if I am wrong: now that we only adjust the counter for
> PMD-sized THP, it is as (un)reliable as it always was.

Yes. The problem introduced by mTHP was somehow workaround'ed by that commit.

>
> Or was there another unintended change by some of my cleanups or
> previous patches?

No, at least I didn't see for now.

>
> >
> >>
> >> I can easily write a program that keeps sending your counter to infinity
> >> simply by triggering that behavior in a loop, so it's all a bit shaky.
> >
> > I don't doubt that. But let's get back to reality. The counter used to
> > stay reasonable and reliable with most real life workloads before
> > mTHP. There may be over-counting, for example, when unmapping a
> > PTE-mapped THP which was not on a deferred split queue before. But
> > such a case is not common for real life workloads because the huge PMD
> > has to be split by partial unmap for most cases. And the partial unmap
> > will add the THP to deferred split queue.
> >
> > But now a common workload, for example, just process exit, may
> > probably send the counter to infinity.
>
> Agreed, that's stupid.
>
> >
> >>
> >> Something like Ryans script makes more sense, where you get a clearer
> >> picture of what's mapped where and how. Because that information can be
> >> much more valuable than just knowing if it's mapped fully or partially
> >> (again, relevant for handling with memory waste).
> >
> > Ryan's script is very helpful. But the counter has been existing and
> > used for years, and it is a quick indicator and much easier to monitor
> > in a large-scale fleet.
> >
> > If we think the reliability of the counter is not worth fixing, why
> > don't we just remove it. No counter is better than a broken counter.
>
> Again, is only counting the PMD-sized THPs "fixing" the old use cases?

Yes

> Then it should just stick around. And we can even optimize it for some
> more cases as proposed in this patch. But there is no easy way to "get
> it completely right" I'm afraid.

I don't mean we should revert that "fixing", my point is we should not
rely on it and we should make rmap remove code behave more reliable
regardless of whether we just count PMD-sized THP or not.

>
> --
> Cheers,
>
> David / dhildenb
>