Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

Michal Hocko <mhocko@xxxxxxxx> · Thu, 9 Apr 2020 17:45:38 +0200

On Thu 09-04-20 20:59:14, Prathu Baronia wrote:
> Following your response, I tried to find out real benefits of removing    
> effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below    
> on top of the base code:    
>     
> -------------------------------------------------------
> include/linux/highmem.h | 3 +--    
> 1 file changed, 1 insertion(+), 2 deletions(-)    
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644    
> --- a/include/linux/highmem.h    
> +++ b/include/linux/highmem.h    
> @@ -145,9 +145,8 @@ do {                                                            \    
> #ifndef clear_user_highpage    
> static inline void clear_user_highpage(struct page *page, unsigned long vaddr)  {    
> -       void *addr = kmap_atomic(page);    
> +       void *addr = page_address(page);    
>         clear_user_page(addr, vaddr, page);    
> -       kunmap_atomic(addr);    
> }    
> #endif    
> -------------------------------------------------------
> 
> For consistency, I kept CPU, DDR and cache on the performance governor. Target
> used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is
> Cortex-A55 and CPU6 is Cortex-A76.
> 
> And the result of profiling of clear_huge_page() is as follows:
> -------------------------------------------------------
> Ftrace results: Time mentioned is in micro-seconds.
> -------------------------------------------------------
> - Base:
> 	- CPU0:
> 		- Sample size : 95
> 		- Mean : 237.383
> 		- Std dev : 31.288
> 	- CPU6:
> 		- Sample size : 61
> 		- Mean : 258.065
> 		- Std dev : 19.97
> 
> -------------------------------------------------------
> - v1 (original submission):
> 	- CPU0:
> 		- Sample size : 80
> 		- Mean : 112.298
> 		- Std dev : 0.36
> 	- CPU6:
> 		- Sample size : 83
> 		- Mean : 71.238
> 		- Std dev : 13.7819
> -------------------------------------------------------
> - exp-v2 (experimental diff mentioned above):
> 	- CPU0:
> 		- Sample size : 69
> 		- Mean : 218.911
> 		- Std dev : 54.306
> 	- CPU6:
> 		- Sample size : 101
> 		- Mean : 241.522
> 		- Std dev : 19.3068
> -------------------------------------------------------
> 
> - Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't
>   Improve results significantly.

Unless I am misreading those numbers, barrier() doesn't change anything
because differences are withing a noise. So the difference is indeed
caused by the more clever initialization to keep the faulted address
cache hot.

Could you be more specific how have you noticed the slow down? I mean,
is there any real world workload that you have observed a regression for
and narrowed it down to zeroing?

I do realize that the initialization improvement patch doesn't really
mention any real life usecase either. It is based on a microbenchmark
but the objective sounds reasonable. If it regresses some other
workloads then we either have to make it conditional or find out what is
causing the regression and how much that regression actually matters.
-- 
Michal Hocko
SUSE Labs