Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@xxxxxxxxxx> wrote: >> >> For highmem and page-at-a-time archs we would need to keep some >> of the same optimizations (via the common clear/copy_user_highpages().) > > Yeah, I guess that we could keep the code for legacy use, just make > the existing code be marked __weak so that it can be ignored for any > further work. > > IOW, the first patch might be to just add that __weak to > 'clear_huge_page()' and 'copy_user_huge_page()'. > > At that point, any architecture can just say "I will implement my own > versions of these two". > > In fact, you can start with just one or the other, which is probably > nicer to keep the patch series smaller (ie do the simpler > "clear_huge_page()" first). Agreed. Best way to iron out all the kinks too. > I worry a bit about the insanity of the "gigantic" pages, and the > mem_map_next() games it plays, but that code is from 2008 and I really > doubt it makes any sense to keep around at least for x86. The source > of that abomination is powerpc, and I do not think that whole issue > with MAX_ORDER_NR_PAGES makes any difference on x86, at least. Looking at it now, it seems to be caused by the wide range of MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try to figure it out in detail. But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS? An arch specific clear_huge_page() code could, however handle 1GB pages via some kind of static loop around (30 - MAX_SECTION_BITS). I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP configs. But, I think we should be able to not look up pfn_to_page(), page_to_pfn() at all or at least not in the inner loop. > It most definitely makes no sense when there is no highmem issues, and > all those 'struct page' games should just be deleted (or at least > relegated entirely to that "legacy __weak function" case so that sane > situations don't need to care). Yeah, I'm hoping to do exactly that. > For that same HIGHMEM reason it's probably a good idea to limit the > new case just to x86-64, and leave 32-bit x86 behind. Ack that. >> Right. Or doing the whole contiguous area in one or a few chunks >> chunks, and then touching the faulting cachelines towards the end. > > Yeah, just add a prefetch for the 'addr_hint' part at the end. > >> > Maybe an architecture could do even more radical things like "let's >> > just 'rep stos' for the whole area, but set a special thread flag that >> > causes the interrupt return to break it up on return to kernel space". >> > IOW, the "latency fix" might not even be about chunking it up, it >> > might look more like our exception handling thing. >> >> When I was thinking about this earlier, I had a vague inkling of >> setting a thread flag and defer writes to the last few cachelines >> for just before returning to user-space. >> Can you elaborate a little about what you are describing above? > > So 'process_huge_page()' (and the gigantic page case) does three very > different things: > > (a) that page chunking for highmem accesses > > (b) the page access _ordering_ for the cache hinting reasons > > (c) the chunking for _latency_ reasons > > and I think all of them are basically "bad legacy" reasons, in that > > (a) HIGHMEM doesn't exist on sane architectures that we care about these days > > (b) the cache hinting ordering makes no sense if you do non-temporal > accesses (and might then be replaced by a possible "prefetch" at the > end) > > (c) the latency reasons still *do* exist, but only with PREEMPT_NONE > > So what I was alluding to with those "more radical approaches" was > that PREEMPT_NONE case: we would probably still want to chunk things > up for latency reasons and do that "cond_resched()" in between > chunks. Thanks for the detail. That helps. > Now, there are alternatives here: > > (a) only override that existing disgusting (but tested) function when > both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false > > (b) do something like this: > > void clear_huge_page(struct page *page, > unsigned long addr_hint, > unsigned int pages_per_huge_page) > { > void *addr = page_address(page); > #ifdef CONFIG_PREEMPT_NONE > for (int i = 0; i < pages_per_huge_page; i++) > clear_page(addr, PAGE_SIZE); > cond_preempt(); > } > #else > nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); > prefetch(addr_hint); > #endif > } We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY as well, right? Either way, as you said earlier could chunk up in bigger units than a single page. (In the numbers I had posted earlier, chunking in units of upto 1MB gave ~25% higher clearing BW. Don't think the microcode setup costs are that high, but don't have a good explanation why.) > or (c), do that "more radical approach", where you do something like this: > > void clear_huge_page(struct page *page, > unsigned long addr_hint, > unsigned int pages_per_huge_page) > { > set_thread_flag(TIF_PREEMPT_ME); > nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page); > clear_thread_flag(TIF_PREEMPT_ME); > prefetch(addr_hint); > } > > and then you make the "return to kernel mode" check the TIF_PREEMPT_ME > case and actually force preemption even on a non-preempt kernel. I like this one. I'll try out (b) and (c) and see how the code shakes out. Just one minor point -- seems to me that the choice of nontemporal or temporal might have to be based on a hint to clear_huge_page(). Basically the nontemporal path is only faster for (pages_per_huge_page * PAGE_SIZE > LLC-size). So in the page-fault path it might make sense to use the temporal path (except for gigantic pages.) In the prefault path, nontemporal might be better. Thanks -- ankur