On Fri, May 01, 2020 at 02:28:55PM +0530, Prathu Baronia wrote: > Platform and setup conditions: > Qualcomm's SM8150 platform under controlled conditions(i.e. only CPU0 and 6 > turned on and set to max frequency, and DDR set to performance governor). > --------------------------------------------------------------------------- > > --------------------------------------------------------------------------- > Summary: > We observed a ~61% improvement in executon time of clearing a hugepage > in the case of arm64 if we increase the granularity i.e. the chunk size > to 64KB from 4KB for each chunk clearing subroutine call. > --------------------------------------------------------------------------- > > For the base build: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 95 > - Mean: 242.099 us > - Std dev: 45.0096 us That's one hell of a deviation. Any idea what's going on there? > - CPU6: > - Samples: 61 > - Mean: 258.372 us > - Std dev: 22.0754 us > > With patches [PATCH {1,2,3}/4] provided at the end where we just revert the > forward-reverse traversal code we observed: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 77 > - Mean: 234.568 > - Std dev: 6.52 > - CPU6: > - Samples: 81 > - Mean: 259.437 > - Std dev: 19.25 > > We were expecting a bit of an improvement for arm64's case because of our > hypothesis that reverse traversal is considerably slower in arm64 but after Will > Deacon's test code which showed similar timings for forward and reverse > traversals we digged a bit deeper into this. > > I found that In the case of arm64 a page is cleared using a special clear_page.S > assembly routine instead of an explicit call to memset. With the below patch we > bypassed the assembly routine and oberserved improvement in execution time of > clear_huge_page on CPU0. > > diff --git a/include/linux/highmem.h b/include/linux/highmem.h > index ea5cdbd8c2c3..a0a97a95aee8 100644 > --- a/include/linux/highmem.h > +++ b/include/linux/highmem.h > @@ -158,7 +158,7 @@ do { > \ > static inline void clear_user_highpage(struct page *page, unsigned long vaddr) > { > void *addr = kmap_atomic(page); > - clear_user_page(addr, vaddr, page); > + memset(addr, 0x0, PAGE_SIZE); > kunmap_atomic(addr); > } > #endif > > For reference I will call the above patch v-exp. > > When v-exp is applied on base we observed: > > clear_huge_page() ftrace profile > -------------------------------- > - CPU0: > - Samples: 71 > - Mean: 124.657 us > - Std dev: 0.494165 us This doesn't make any sense to me. memset() of zero is special-cased to use the DC ZVA instruction in a loop: 3: dc zva, dst add dst, dst, zva_len_x subs count, count, zva_len_x b.ge 3b which is basically the same as clear_page(): 1: dc zva, x0 add x0, x0, x1 tst x0, #(PAGE_SIZE - 1) b.ne 1b Are you able to reproduce this in userspace? Will