Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Sat, 11 Apr 2020 13:47:47 -0700

On Sat, Apr 11, 2020 at 8:40 AM Chintan Pandya
<chintan.pandya@xxxxxxxxxxx> wrote:
>
> > > Generally, many architectures are optimized for serial loads, be it
> > > initialization or access, as it is simplest form of prediction. Any
> > > random access pattern would kill that pre-fetching. And for now, I
> > > suspect that to be the case here. Probably, we can run more tests to confirm
> > this part.
> >
> > Please prove your theory with test.  Better to test x86 too.
>
> Wrote down below userspace test code.
>
> Code:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/time.h>
>
>
> #define SZ_1M 0x100000
> #define SZ_4K 0x1000
> #define NUM 100
>
> Int main ()
> {
>   void *p;
>   void *q;
>   void *r;
>
>   unsigned long total_pages, total_size;
>   int i, j;
>   struct timeval t0, t1, t2, t3;
>   int elapsed;
>
>   printf ("Hello World\n");
>
>   total_size = NUM * SZ_1M;
>   total_pages = NUM * (SZ_1M / SZ_4K);
>
>   p = malloc (total_size);
>   q = malloc (total_size);
>   r = malloc (total_size);
>
>   /* So that all pages gets allocated */
>   memset (r, 0xa, total_size);
>   memset (q, 0xa, total_size);
>   memset (p, 0xa, total_size);
>
>   gettimeofday (&t0, NULL);
>
>   /* One shot memset */
>   memset (r, 0xd, total_size);
>
>   gettimeofday (&t1, NULL);
>
>   /* traverse in forward order */
>   for (j = 0; j < total_pages; j++)
>     {
>       memset (q + (j * SZ_4K), 0xc, SZ_4K);
>     }
>
>   gettimeofday (&t2, NULL);
>
>   /* traverse in reverse order */
>   for (i = 0; i < total_pages; i++)
>     {
>       memset (p + total_size - (i + 1) * SZ_4K, 0xb, SZ_4K);
>     }
>
>   gettimeofday (&t3, NULL);
>
>   free (p);
>   free (q);
>   free (r);
>
>   /* Results time */
>   elapsed = ((t1.tv_sec - t0.tv_sec) * 1000000) + (t1.tv_usec - t0.tv_usec);
>   printf ("One shot: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t2.tv_sec - t1.tv_sec) * 1000000) + (t2.tv_usec - t1.tv_usec);
>   printf ("Forward order: %d micro seconds\n", elapsed);
>
>
>   elapsed = ((t3.tv_sec - t2.tv_sec) * 1000000) + (t3.tv_usec - t2.tv_usec);
>   printf ("Reverse order: %d micro seconds\n", elapsed);
>   return 0;
> }
>
> ------------------------------------------------------------------------------------------------
>
> Results for ARM64 target (SM8150 , CPU0 & 6 are online, running at max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3389.26 us
> - Forward : 8876.16 us
> - Reverse : 18157.6 us

This is an interesting data point. So running things in reverse seems
much more expensive than running them forward. As such I would imagine
process_huge_page is going to be significantly more expensive then on
ARM64 since it will wind through the pages in reverse order from the
end of the page all the way down to wherever the page was accessed.

I wonder if we couldn't simply process_huge_page to process pages in
two passes? The first being from the addr_hint + some offset to the
end, and then loop back around to the start of the page for the second
pass and just process up to where we started the first pass. The idea
would be that the offset would be enough so that we have the 4K that
was accessed plus some range before and after the address hopefully
still in the L1 cache after we are done.

> Results for x86-64 (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz, only CPU 0 in max frequency)
> All numbers are mean of 100 iterations. Variation is ignorable.
> - Oneshot : 3203.49 us
> - Forward : 5766.46 us
> - Reverse : 5187.86 us
>
> To conclude, I observed optimized serial writes in case of ARM processor. But strangely,
> memset in reverse order performs better than forward order quite consistently across
> multiple x86 machines. I don't have much insight into x86 so to clarify, I would like to
> restrict my previous suspicion to ARM only.

What compiler options did you build the test code with? One
possibility is that the compiler may have optimized away
total_pages/total_size/i all into on variable and simply tracked it
until i is less than 0. I know I regularly will write loops to run in
reverse order for that reason as it tends to perform pretty well on
x86 as all you have to do is a sub or dec and then test the signed
flag to determine if you exit the loop.

An additional thing I was just wondering is if this also impacts the
copy operations as well? Looking through the code the two big users
for process_huge_page are clear_huge_page and copy_user_huge_page. One
thing that might make more sense than just splitting the code at a
high level would be to look at possibly refactoring process_huge_page
and the users for it.