Re: [RFC] mm/memory.c: Optimizing THP zeroing routine for !HIGHMEM cases

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Mon, 13 Apr 2020 09:24:29 -0700

On Mon, Apr 13, 2020 at 8:34 AM Prathu Baronia
<prathu.baronia@xxxxxxxxxxx> wrote:
>
> The 04/11/2020 13:47, Alexander Duyck wrote:
> >
> > This is an interesting data point. So running things in reverse seems
> > much more expensive than running them forward. As such I would imagine
> > process_huge_page is going to be significantly more expensive then on
> > ARM64 since it will wind through the pages in reverse order from the
> > end of the page all the way down to wherever the page was accessed.
> >
> > I wonder if we couldn't simply process_huge_page to process pages in
> > two passes? The first being from the addr_hint + some offset to the
> > end, and then loop back around to the start of the page for the second
> > pass and just process up to where we started the first pass. The idea
> > would be that the offset would be enough so that we have the 4K that
> > was accessed plus some range before and after the address hopefully
> > still in the L1 cache after we are done.
> That's a great idea, we were working on a similar idea for the v2 patch and you
> suggesting this idea has reassured our approach. This will incorporate the
> benefits of optimized memset and will keep the cache hot around the
> faulting address.
>
> Earlier we had taken this offset as 0.5MB and after your response we have kept it
> as 32KB. As we understand there is a trade-off associated with keeping this value
> too high, we would really appreciate if you can suggest a method to derive an
> appropriate value for this offset from the L1 cache size.

I mentioned 32KB as a value since that happens to be a common value
for L1 cache size on both the ARM64 processor mentioned, and most
modern x86 CPUs. As far as deriving it I don't know if there is a good
way to go about doing that. I suspect it is something that would need
to be architecture specific. If nothing else you might be able to do
something like define it similar to how L1_CACHE_SHIFT/BYTES is
defined in cache.h for most architectures. Also we probably want to
play around with that value a bit as well as I suspect there may be
some room to either increase or decrease the value depending on the
cost for cold accesses versus being able to process memory
initialization in larger batches.

> > An additional thing I was just wondering is if this also impacts the
> > copy operations as well? Looking through the code the two big users
> > for process_huge_page are clear_huge_page and copy_user_huge_page. One
> > thing that might make more sense than just splitting the code at a
> > high level would be to look at possibly refactoring process_huge_page
> > and the users for it.
> You are right, we didn't consider refactoring process_huge_page earlier. We
> will incorporate this in the soon to be sent v2 patch.
>
> Thanks a lot for the interesting insights!

Sounds good. I'll look forward to v2.

Thanks.

- Alex