On Fri, Apr 3, 2020 at 1:18 AM Prathu Baronia <prathu.baronia@xxxxxxxxxxx> wrote: > > THP allocation for anon memory requires zeroing of the huge page. To do so, > we iterate over 2MB memory in 4KB chunks. Each iteration calls for kmap_atomic() > and kunmap_atomic(). This routine makes sense where we need temporary mapping of > the user page. In !HIGHMEM cases, specially in 64-bit architectures, we don't > need temp mapping. Hence, kmap_atomic() acts as nothing more than multiple > barrier() calls. > > This called for optimization. Simply getting VADDR from page does the job for > us. So, implement another (optimized) routine for clear_huge_page() which > doesn't need temporary mapping of user space page. > > While testing this patch on Qualcomm SM8150 SoC (kernel v4.14.117), we see 64% > Improvement in clear_huge_page(). > > Ftrace results: > > Default profile: > ------------------------------------------ > 6) ! 473.802 us | clear_huge_page(); > ------------------------------------------ > > With this patch applied: > ------------------------------------------ > 5) ! 170.156 us | clear_huge_page(); > ------------------------------------------ I suspect that if anything this is really pointing out how much overhead is being added through process_huge_page. I know for x86 most of the modern processors are somewhere between 16B/cycle or 32B/cycle to initialize memory with some fixed amount of overhead for making the rep movsb/stosb call. One thing that might make sense to look at would be to see if we could possibly reduce the number of calls we have to make with process_huge_page by taking the caches into account. For example I know on x86 the L1 cache is 32K for most processors, so we could look at possibly bumping things up so that we are processing 8 pages at a time and then making a call to cond_resched() instead of doing it per 4K page. > Signed-off-by: Prathu Baronia <prathu.baronia@xxxxxxxxxxx> > Reported-by: Chintan Pandya <chintan.pandya@xxxxxxxxxxx> > --- > mm/memory.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index 3ee073d..3e120e8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5119,6 +5119,7 @@ EXPORT_SYMBOL(__might_fault); > #endif > > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS) > +#ifdef CONFIG_HIGHMEM > static void clear_gigantic_page(struct page *page, > unsigned long addr, > unsigned int pages_per_huge_page) > @@ -5183,6 +5184,16 @@ void clear_huge_page(struct page *page, > addr + right_idx * PAGE_SIZE); > } > } > +#else > +void clear_huge_page(struct page *page, > + unsigned long addr_hint, unsigned int pages_per_huge_page) > +{ > + void *addr; > + > + addr = page_address(page); > + memset(addr, 0, pages_per_huge_page*PAGE_SIZE); > +} > +#endif This seems like a very simplistic solution to the problem, and I am worried something like this would introduce latency issues when pages_per_huge_page gets to be large. It might make more sense to just wrap the process_huge_page call in the original clear_huge_page and then add this code block as an #else case. That way you avoid potentially stalling a system for extended periods of time if you start trying to clear 1G pages with the function. One interesting data point would be to see what the cost is for breaking this up into a loop where you only process some fixed number of pages and running it with cond_resched() so you can avoid introducing latency spikes.