Following your response, I tried to find out real benefits of removing effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below on top of the base code: ------------------------------------------------------- include/linux/highmem.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -145,9 +145,8 @@ do { \ #ifndef clear_user_highpage static inline void clear_user_highpage(struct page *page, unsigned long vaddr) { - void *addr = kmap_atomic(page); + void *addr = page_address(page); clear_user_page(addr, vaddr, page); - kunmap_atomic(addr); } #endif ------------------------------------------------------- For consistency, I kept CPU, DDR and cache on the performance governor. Target used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is Cortex-A55 and CPU6 is Cortex-A76. And the result of profiling of clear_huge_page() is as follows: ------------------------------------------------------- Ftrace results: Time mentioned is in micro-seconds. ------------------------------------------------------- - Base: - CPU0: - Sample size : 95 - Mean : 237.383 - Std dev : 31.288 - CPU6: - Sample size : 61 - Mean : 258.065 - Std dev : 19.97 ------------------------------------------------------- - v1 (original submission): - CPU0: - Sample size : 80 - Mean : 112.298 - Std dev : 0.36 - CPU6: - Sample size : 83 - Mean : 71.238 - Std dev : 13.7819 ------------------------------------------------------- - exp-v2 (experimental diff mentioned above): - CPU0: - Sample size : 69 - Mean : 218.911 - Std dev : 54.306 - CPU6: - Sample size : 101 - Mean : 241.522 - Std dev : 19.3068 ------------------------------------------------------- - Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't Improve results significantly. - Comparing v1 vs exp-v2: memset(0) of 2MB page straight is significantly faster than Zeroing individual pages. - Analysing base and exp-v2: It was expected that CPU6 should have outperformed CPU0. But the zeroing pattern is adversarial for CPU6 and end up performing poor. Whereas, CPU6 truly outperforms CPU0 in serialized load. Based on above 3 points, it looks like calling straight memset(0) indeed improves Execution time, primarily due to predictable pattern of execution for most CPU Architectures out there. Having said that, I also understand that, v1 will loose out on optimization made by c79b57e462b5 which keeps caches hot around faulting address. If having caches hot around faulting address is so important (which numbers can prove, and I don't have insights to get those numbers), it might be better to develop on top of v1 than not using v1 at all. The 04/03/2020 10:52, Michal Hocko wrote: > > This is an old kernel. Do you see the same with the current upstream > kernel? Btw. 60% improvement only from dropping barrier sounds > unexpected to me. Are you sure this is the only reason? c79b57e462b5 > ("mm: hugetlb: clear target sub-page last when clearing huge page") > is already 4.14 AFAICS, is it possible that this is the effect of this > patch? Your patch is effectively disabling this optimization for most > workloads that really care about it. I strongly doubt that hugetlb is a > thing on 32b kernels these days. So this really begs for more data about > the real underlying problem IMHO. > > -- > Michal Hocko > SUSE Labs