On Thu 09-04-20 20:59:14, Prathu Baronia wrote: > Following your response, I tried to find out real benefits of removing > effective barrier() calls. To find that out, I wrote simple diff (exp-v2) as below > on top of the base code: > > ------------------------------------------------------- > include/linux/highmem.h | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h index b471a88.. df908b4 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -145,9 +145,8 @@ do { \ > #ifndef clear_user_highpage > static inline void clear_user_highpage(struct page *page, unsigned long vaddr) { > - void *addr = kmap_atomic(page); > + void *addr = page_address(page); > clear_user_page(addr, vaddr, page); > - kunmap_atomic(addr); > } > #endif > ------------------------------------------------------- > > For consistency, I kept CPU, DDR and cache on the performance governor. Target > used is Qualcomm's SM8150 with kernel 4.14.117. In this platform, CPU0 is > Cortex-A55 and CPU6 is Cortex-A76. > > And the result of profiling of clear_huge_page() is as follows: > ------------------------------------------------------- > Ftrace results: Time mentioned is in micro-seconds. > ------------------------------------------------------- > - Base: > - CPU0: > - Sample size : 95 > - Mean : 237.383 > - Std dev : 31.288 > - CPU6: > - Sample size : 61 > - Mean : 258.065 > - Std dev : 19.97 > > ------------------------------------------------------- > - v1 (original submission): > - CPU0: > - Sample size : 80 > - Mean : 112.298 > - Std dev : 0.36 > - CPU6: > - Sample size : 83 > - Mean : 71.238 > - Std dev : 13.7819 > ------------------------------------------------------- > - exp-v2 (experimental diff mentioned above): > - CPU0: > - Sample size : 69 > - Mean : 218.911 > - Std dev : 54.306 > - CPU6: > - Sample size : 101 > - Mean : 241.522 > - Std dev : 19.3068 > ------------------------------------------------------- > > - Comparing base vs exp-v2: Simply removing barriers from kmap_atomic() code doesn't > Improve results significantly. Unless I am misreading those numbers, barrier() doesn't change anything because differences are withing a noise. So the difference is indeed caused by the more clever initialization to keep the faulted address cache hot. Could you be more specific how have you noticed the slow down? I mean, is there any real world workload that you have observed a regression for and narrowed it down to zeroing? I do realize that the initialization improvement patch doesn't really mention any real life usecase either. It is based on a microbenchmark but the objective sounds reasonable. If it regresses some other workloads then we either have to make it conditional or find out what is causing the regression and how much that regression actually matters. -- Michal Hocko SUSE Labs