Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

Prathu Baronia <prathu.baronia@xxxxxxxxxxx> · Thu, 9 Apr 2020 20:59:14 +0530

Following your response, I tried to find out real benefits of removing    
effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below    
on top of the base code:    
    
-------------------------------------------------------
include/linux/highmem.h | 3 +--    
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644    
--- a/include/linux/highmem.h    
+++ b/include/linux/highmem.h    
@@ -145,9 +145,8 @@ do {                                                            \    
#ifndef clear_user_highpage    
static inline void clear_user_highpage(struct page *page, unsigned long vaddr)  {    
-       void *addr = kmap_atomic(page);    
+       void *addr = page_address(page);    
        clear_user_page(addr, vaddr, page);    
-       kunmap_atomic(addr);    
}    
#endif    
-------------------------------------------------------

For consistency, I kept CPU, DDR and cache on the performance governor. Target
used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is
Cortex-A55 and CPU6 is Cortex-A76.

And the result of profiling of clear_huge_page() is as follows:
-------------------------------------------------------
Ftrace results: Time mentioned is in micro-seconds.
-------------------------------------------------------
- Base:
	- CPU0:
		- Sample size : 95
		- Mean : 237.383
		- Std dev : 31.288
	- CPU6:
		- Sample size : 61
		- Mean : 258.065
		- Std dev : 19.97

-------------------------------------------------------
- v1 (original submission):
	- CPU0:
		- Sample size : 80
		- Mean : 112.298
		- Std dev : 0.36
	- CPU6:
		- Sample size : 83
		- Mean : 71.238
		- Std dev : 13.7819
-------------------------------------------------------
- exp-v2 (experimental diff mentioned above):
	- CPU0:
		- Sample size : 69
		- Mean : 218.911
		- Std dev : 54.306
	- CPU6:
		- Sample size : 101
		- Mean : 241.522
		- Std dev : 19.3068
-------------------------------------------------------

- Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't
  Improve results significantly.

- Comparing v1 vs exp-v2: memset(0) of 2MB page straight is significantly faster than
  Zeroing individual pages.

- Analysing base and exp-v2: It was expected that CPU6 should have outperformed CPU0.
  But the zeroing pattern is adversarial for CPU6 and end up performing poor. Whereas,
  CPU6 truly outperforms CPU0 in serialized load.

Based on above 3 points, it looks like calling straight memset(0) indeed
improves Execution time, primarily due to predictable pattern of execution for
most CPU Architectures out there.

Having said that, I also understand that, v1 will loose out on optimization made
by c79b57e462b5 which keeps caches hot around faulting address. If having caches
hot around faulting address is so important (which numbers can prove, and I
don't have insights to get those numbers), it might be better to develop on top
of v1 than not using v1 at all. 

The 04/03/2020 10:52, Michal Hocko wrote:
> 
> This is an old kernel. Do you see the same with the current upstream
> kernel? Btw. 60% improvement only from dropping barrier sounds
> unexpected to me. Are you sure this is the only reason? c79b57e462b5
> ("mm: hugetlb: clear target sub-page last when clearing huge page")
> is already 4.14 AFAICS, is it possible that this is the effect of this
> patch? Your patch is effectively disabling this optimization for most
> workloads that really care about it. I strongly doubt that hugetlb is a
> thing on 32b kernels these days. So this really begs for more data about
> the real underlying problem IMHO.
> 
> -- 
> Michal Hocko
> SUSE Labs