On Tue, Apr 14, 2020 at 11:47 AM Prathu Baronia <prathu.baronia@xxxxxxxxxxx> wrote: > > The 04/14/2020 19:03, Michal Hocko wrote: > > I still have hard time to see why kmap machinery should introduce any > > slowdown here. Previous data posted while discussing v1 didn't really > > show anything outside of the noise. > > > You are right, the multiple barriers are not responsible for the slowdown, but > removal of kmap_atomic() allows us to call memset and memcpy for larger sizes. > I will re-frame this part of the commit text when we proceed towards v3 to > present it more cleanly. > > > > It would be really nice to provide std > > > Here is the data with std:- > ---------------------------------------------------------------------- > Results: > ---------------------------------------------------------------------- > Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max > frequency) > All numbers are mean of 100 iterations. Variation is ignorable. > ---------------------------------------------------------------------- > - Oneshot : 3389.26 us std: 79.1377 us > - Forward : 8876.16 us std: 172.699 us > - Reverse : 18157.6 us std: 111.713 us > ---------------------------------------------------------------------- > > ---------------------------------------------------------------------- > Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in > max frequency, DDR also running at max frequency.) All numbers are mean of > 100 iterations. Variation is ignorable. > ---------------------------------------------------------------------- > - Oneshot : 3203.49 us std: 115.4086 us > - Forward : 5766.46 us std: 328.6299 us > - Reverse : 5187.86 us std: 341.1918 us > ---------------------------------------------------------------------- > > > > > No. There is absolutely zero reason to add a config option for this. The > > kernel should have all the information to make an educated guess. > > > I will try to incorporate this in v3. But currently I don't have any idea on how > to go about implementing the guessing logic. Would really appreciate if you can > suggest some way to go about it. > > > Also before going any further. The patch which has introduced the > > optimization was c79b57e462b5 ("mm: hugetlb: clear target sub-page last > > when clearing huge page"). It is based on an artificial benchmark which > > to my knowledge doesn't represent any real workload. Your measurements > > are based on a different benchmark. Your numbers clearly show that some > > assumptions used for the optimization are not architecture neutral. > > > But oneshot numbers are significantly better on both the archs. I think > theoretically the oneshot approach should provide better results on all the > architectures when compared with serial approach. Isn't it a fair assumption to > go ahead with the oneshot approach? I think the point that Michal is getting at is that there are other tests that need to be run. You are running the test on just one core. What happens as we start fanning this out and having multiple instances running per socket? We would be flooding the LLC in addition to overwriting all the other caches. If you take a look at commit c6ddfb6c58903 ("mm, clear_huge_page: move order algorithm into a separate function") they were running the tests on multiple threads simultaneously as their concern was flooding the LLC cache. I wonder if we couldn't look at bypassing the cache entirely using something like __copy_user_nocache for some portion of the copy and then only copy in the last pieces that we think will be immediately accessed.