Re: [PATCH v3 00/21] huge page clearing optimizations

Ankur Arora <ankur.a.arora@xxxxxxxxxx> · Thu, 09 Jun 2022 00:54:20 +0530

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).

Agreed. Best way to iron out all the kinks too.

> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.

Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.

But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).

I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.

> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).

Yeah, I'm hoping to do exactly that.

> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.

Ack that.

>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
>  (a) that page chunking for highmem accesses
>
>  (b) the page access _ordering_ for the cache hinting reasons
>
>  (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
>  (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
>  (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
>  (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in  between
> chunks.

Thanks for the detail. That helps.

> Now, there are alternatives here:
>
>  (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
>  (b) do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         void *addr = page_address(page);
>     #ifdef CONFIG_PREEMPT_NONE
>         for (int i = 0; i < pages_per_huge_page; i++)
>             clear_page(addr, PAGE_SIZE);
>             cond_preempt();
>         }
>     #else
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         prefetch(addr_hint);
>     #endif
>     }

We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)

>  or (c), do that "more radical approach", where you do something like this:
>
>     void clear_huge_page(struct page *page,
>         unsigned long addr_hint,
>         unsigned int pages_per_huge_page)
>     {
>         set_thread_flag(TIF_PREEMPT_ME);
>         nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
>         clear_thread_flag(TIF_PREEMPT_ME);
>         prefetch(addr_hint);
>     }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.

I like this one. I'll try out (b) and (c) and see how the code shakes
out.

Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().

Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).

So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.

Thanks

--
ankur