On Wed, Apr 22, 2020 at 09:18:52AM +0100, Will Deacon wrote: > On Tue, Apr 21, 2020 at 03:48:04PM +0200, Vlastimil Babka wrote: > > On 4/21/20 3:39 PM, Will Deacon wrote: > > > On Tue, Apr 21, 2020 at 02:48:04PM +0200, Vlastimil Babka wrote: > > >> On 4/21/20 2:47 PM, Vlastimil Babka wrote: > > >> > > > >> > It was suspected that current Intel can prefetch forward and backwards, and the > > >> > tested ARM64 microarchitecture only backwards, can it be true? The current code > > >> > > >> Oops, tested ARM64 microarchitecture I meant "only forwards". > > > > > > I'd be surprised if that's the case, but it could be that there's an erratum > > > workaround in play which hampers the prefetch behaviour. We generally try > > > not to assume too much about the prefetcher on arm64 because they're not > > > well documented and vary wildly between different micro-architectures. > > > > Yeah it's probably not as simple as I thought, as the test code [1] shows the > > page iteration goes backwards, but per-page memsets are not special. So maybe > > it's not hardware specifics, but x86 memtest implementation is also done > > backwards, so it fits the backwards outer loop, but arm64 memset is forward, so > > the resulting pattern is non-linear? > > A straightforward linear prefetcher would probably be defeated by that sort > of thing, yes, but I'd have thought that the recent CPUs (e.g. A76 which I > think is the "big" CPU in the SoC mentioned at the start of the thread) > would still have a fighting chance at prefetching based on non-linear > histories. > > However, to my earlier point, we're making this more difficult than it needs > to be for the hardware and we shouldn't assume that all prefetchers will > handle it gracefully, so keeping the core code relatively straightforward > does seem to be the best bet. Alarm bells just rang initially when it > appeared that we were optimising code under arch/arm64 rather than improving > the core code, but I now have a better picture of what's going on (thanks). > > Alternatively, we could switch our memset() around, but I'm worried that > we could end up hurting something else by doing that. I guess we could add a > memset_backwards() version if we *had* to... > > > In that case it's also a question if the measurement was done in kernel or > > userspace, and if userspace memset have any implications for kernel memset... > > Sounds like it was done in userspace. If I get a chance later on, I'll try > to give it a spin here on some of the boards I have kicking around. I wrote the silly harness below for the snippets given in [1] but I can't see any difference between the forwards and backwards versions on any arm64 systems I have access to. Will [1] https://lore.kernel.org/linux-mm/20200414153829.GA15230@xxxxxxxxxxx/ --->8 #if 0 /* One shot memset */ memset (r, 0xd, total_size); /* traverse in forward order */ for (j = 0; j < total_pages; j++) { memset (q + (j * SZ_4K), 0xc, SZ_4K); } /* traverse in reverse order */ for (i = 0; i < total_pages; i++) { memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K); } #endif #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h> #include <unistd.h> #define BUF_SZ (1UL*1024*1024*1024) #define PAGE_SIZE 0x1000 #define BUF_SZ_PAGES (BUF_SZ / PAGE_SIZE) #define NSECS_PER_SEC 1000000000ULL static void do_the_thing_forwards(void *buf) { unsigned long i; for (i = 0; i < BUF_SZ_PAGES; i++) memset(buf + (i * PAGE_SIZE), 0xc, PAGE_SIZE); } static void do_the_thing_backwards(void *buf) { unsigned long i; for (i = 0; i < BUF_SZ_PAGES; i++) memset(buf + BUF_SZ - (i + 1) * PAGE_SIZE, 0xc, PAGE_SIZE); } int main(void) { void *buf; unsigned long long delta; struct timespec ts_start, ts_end; if (posix_memalign(&buf, PAGE_SIZE, BUF_SZ)) { perror("posix_memalign()"); return -1; } memset(buf, 0xd, BUF_SZ); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) { perror("clock_gettime()"); return -1; } do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); do_the_thing_forwards(buf); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) { perror("clock_gettime()"); return -1; } delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec); delta += (ts_end.tv_nsec - ts_start.tv_nsec); printf("Forwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC)); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_start)) { perror("clock_gettime()"); return -1; } do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); do_the_thing_backwards(buf); if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts_end)) { perror("clock_gettime()"); return -1; } delta = NSECS_PER_SEC * (ts_end.tv_sec - ts_start.tv_sec); delta += (ts_end.tv_nsec - ts_start.tv_nsec); printf("Backwards: took %f seconds\n", (double)(delta / (double)NSECS_PER_SEC)); return 0; }