> On 4/21/20 3:39 PM, Will Deacon wrote: > > On Tue, Apr 21, 2020 at 02:48:04PM +0200, Vlastimil Babka wrote: > >> On 4/21/20 2:47 PM, Vlastimil Babka wrote: > >> > > >> > It was suspected that current Intel can prefetch forward and > >> > backwards, and the tested ARM64 microarchitecture only backwards, > >> > can it be true? The current code > >> > >> Oops, tested ARM64 microarchitecture I meant "only forwards". > > > > I'd be surprised if that's the case, but it could be that there's an > > erratum workaround in play which hampers the prefetch behaviour. We > > generally try not to assume too much about the prefetcher on arm64 > > because they're not well documented and vary wildly between different > micro-architectures. > > Yeah it's probably not as simple as I thought, as the test code [1] shows the > page iteration goes backwards, but per-page memsets are not special. So > maybe it's not hardware specifics, but x86 memtest implementation is also > done backwards, so it fits the backwards outer loop, but arm64 memset is > forward, so the resulting pattern is non-linear? > > In that case it's also a question if the measurement was done in kernel or > userspace, and if userspace memset have any implications for kernel memset... > Yes, Prathu eventually tried and shared test results with memset implementation copied from kernel to his userspace test. And that too suggested poor performance in backward direction for ARM. However, he also profiled clear_huge_page() using ftrace and 2 different cores of SM8150 show improvement with forward memset approach (v2 patch of Prathu). ---------------------------------------------------------------------- Ftrace Results(clear_huge_page ()): ---------------------------------------------------------------------- All timing values are in microseconds(us) ---------------------------------------------------------------------- Base: - CPU0: - Samples: 95 - Mean: 242.099 us - Std dev: 45.0096 us - CPU6: - Samples: 61 - Mean: 258.372 us - Std dev: 22.0754 us ---------------------------------------------------------------------- v2: - CPU0: - Samples: 63 - Mean: 112.297 us - Std dev: 0.310989 us - CPU6: - Samples: 99 - Mean: 67.359 us - Std dev: 1.15997 us ---------------------------------------------------------------------- > [1] > https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kern > el.org%2Flinux- > mm%2F20200414153829.GA15230%40oneplus.com%2F&data=02%7C01% > 7Cchintan.pandya%40oneplus.com%7Ca39a8877831f4f696ea008d7e5fa9f97%7 > C0423909d296c463eab5ce5853a518df8%7C1%7C0%7C637230736895823014& > amp;sdata=HAVsaTkNgmXFkGdULNiIHir7ch3bQQgzn2z2tSZIrxU%3D&reser > ved=0 > > > Will > >