> > Generally, many architectures are optimized for serial loads, be it > > initialization or access, as it is simplest form of prediction. Any > > random access pattern would kill that pre-fetching. And for now, I > > suspect that to be the case here. Probably, we can run more tests to confirm > this part. > > Please prove your theory with test. Better to test x86 too. Wrote down below userspace test code. Code: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SZ_1M 0x100000 #define SZ_4K 0x1000 #define NUM 100 Int main () { void *p; void *q; void *r; unsigned long total_pages, total_size; int i, j; struct timeval t0, t1, t2, t3; int elapsed; printf ("Hello World\n"); total_size = NUM * SZ_1M; total_pages = NUM * (SZ_1M / SZ_4K); p = malloc (total_size); q = malloc (total_size); r = malloc (total_size); /* So that all pages gets allocated */ memset (r, 0xa, total_size); memset (q, 0xa, total_size); memset (p, 0xa, total_size); gettimeofday (&t0, NULL); /* One shot memset */ memset (r, 0xd, total_size); gettimeofday (&t1, NULL); /* traverse in forward order */ for (j = 0; j < total_pages; j++) { memset (q + (j * SZ_4K), 0xc, SZ_4K); } gettimeofday (&t2, NULL); /* traverse in reverse order */ for (i = 0; i < total_pages; i++) { memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K); } gettimeofday (&t3, NULL); free (p); free (q); free (r); /* Results time */ elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec); printf ("One shot: %d micro seconds\n", elapsed); elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec); printf ("Forward order: %d micro seconds\n", elapsed); elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec); printf ("Reverse order: %d micro seconds\n", elapsed); return 0; } ------------------------------------------------------------------------------------------------ Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency) All numbers are mean of 100 iterations. Variation is ignorable. - Oneshot : 3389.26 us - Forward : 8876.16 us - Reverse : 18157.6 us Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency) All numbers are mean of 100 iterations. Variation is ignorable. - Oneshot : 3203.49 us - Forward : 5766.46 us - Reverse : 5187.86 us To conclude, I observed optimized serial writes in case of ARM processor. But strangely, memset in reverse order performs better than forward order quite consistently across multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to restrict my previous suspicion to ARM only. > > Best Regards, > Huang, Ying