Re: [PATCH v3 00/21] huge page clearing optimizations

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 6 Jun 2022 14:53:09 -0700

On Mon, Jun 6, 2022 at 1:22 PM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>
> This series introduces two optimizations in the huge page clearing path:
>
>  1. extends the clear_page() machinery to also handle extents larger
>     than a single page.
>  2. support non-cached page clearing for huge and gigantic pages.
>
> The first optimization is useful for hugepage fault handling, the
> second for prefaulting, or for gigantic pages.

Please just split these two issues up into entirely different patch series.

That said, I have a few complaints about the individual patches even
in this form, to the point where I think the whole series is nasty:

 - get rid of 3/21 entirely. It's wrong in every possible way:

    (a) That shouldn't be an inline function in a header file at all.
If you're clearing several pages of data, that just shouldn't be an
inline function.

    (b) Get rid of __HAVE_ARCH_CLEAR_USER_PAGES. I hate how people
make up those idiotic pointless names.

         If you have to use a #ifdef, just use the name of the
function that the architecture overrides, not some other new name.

   But you don't need it at all, because

    (c) Just make a __weak function called clear_user_highpages() in
mm/highmem.c, and allow architectures to just create their own
non-weak ones.

 - patch 4/21 and 5/32: can  we instead just get rid of that silly
"process_huge_page()" thing entirely. It's disgusting, and it's a big
part of why 'rep movs/stos' cannot work efficiently. It also makes NO
SENSE if you then use non-temporal accesses.

   So instead of doubling down on the craziness of that function, just
get rid of it entirely.

   There are two users, and they want to clear a hugepage and copy it
respectively. Don't make it harder than it is.

    *Maybe* the code wants to do a "prefetch" afterwards. Who knows.
But I really think you sh ould do the crapectomy first, make the code
simpler and more straightforward, and just allow architectures to
override the *simple* "copy or clear a lage page" rather than keep
feeding this butt-ugly monstrosity.

 - 13/21: see 3/21.

 - 14-17/21: see 4/21 and 5/21. Once you do the crapectomy and get rid
of the crazy process_huge_page() abstraction, and just let
architectures do their own clear/copy huge pages, *all* this craziness
goes away. Those "when to use which type of clear/copy" becomes a
*local* question, no silly arch_clear_page_non_caching_threshold()
garbage.

So I really don't like this series. A *lot* of it comes from that
horrible process_huge_page() model, and the whole model is just wrong
and pointless. You're literally trying to fix the mess that that
function is, but you're keeping the fundamental problem around.

The whole *point* of your patch-set is to use non-temporal stores,
which makes all the process_huge_page() things entirely pointless, and
only complicates things.

And even if we don't use non-temporal stores, that process_huge_page()
thing makes for trouble for any "rep stos/movs" implementation that
might actualyl do a better job if it was just chunked up in bigger
chunks.

Yes, yes, you probably still want to chunk that up somewhat due to
latency reasons, but even then architectures might as well just make
their own decisions, rather than have the core mm code make one
clearly bad decision for them. Maybe chunking it up in bigger chunks
than one page.

Maybe an architecture could do even more radical things like "let's
just 'rep stos' for the whole area, but set a special thread flag that
causes the interrupt return to break it up on return to kernel space".
IOW, the "latency fix" might not even be about chunking it up, it
might look more like our exception handling thing.

So I really think that crapectomy should be the first thing you do,
and that should be that first part of "extends the clear_page()
machinery to also handle extents larger than a single page"

                Linus