On Mon, Apr 13, 2020 at 8:34 AM Prathu Baronia <prathu.baronia@xxxxxxxxxxx> wrote: > > The 04/11/2020 13:47, Alexander Duyck wrote: > > > > This is an interesting data point. So running things in reverse seems > > much more expensive than running them forward. As such I would imagine > > process_huge_page is going to be significantly more expensive then on > > ARM64 since it will wind through the pages in reverse order from the > > end of the page all the way down to wherever the page was accessed. > > > > I wonder if we couldn't simply process_huge_page to process pages in > > two passes? The first being from the addr_hint + some offset to the > > end, and then loop back around to the start of the page for the second > > pass and just process up to where we started the first pass. The idea > > would be that the offset would be enough so that we have the 4K that > > was accessed plus some range before and after the address hopefully > > still in the L1 cache after we are done. > That's a great idea, we were working on a similar idea for the v2 patch and you > suggesting this idea has reassured our approach. This will incorporate the > benefits of optimized memset and will keep the cache hot around the > faulting address. > > Earlier we had taken this offset as 0.5MB and after your response we have kept it > as 32KB. As we understand there is a trade-off associated with keeping this value > too high, we would really appreciate if you can suggest a method to derive an > appropriate value for this offset from the L1 cache size. I mentioned 32KB as a value since that happens to be a common value for L1 cache size on both the ARM64 processor mentioned, and most modern x86 CPUs. As far as deriving it I don't know if there is a good way to go about doing that. I suspect it is something that would need to be architecture specific. If nothing else you might be able to do something like define it similar to how L1_CACHE_SHIFT/BYTES is defined in cache.h for most architectures. Also we probably want to play around with that value a bit as well as I suspect there may be some room to either increase or decrease the value depending on the cost for cold accesses versus being able to process memory initialization in larger batches. > > An additional thing I was just wondering is if this also impacts the > > copy operations as well? Looking through the code the two big users > > for process_huge_page are clear_huge_page and copy_user_huge_page. One > > thing that might make more sense than just splitting the code at a > > high level would be to look at possibly refactoring process_huge_page > > and the users for it. > You are right, we didn't consider refactoring process_huge_page earlier. We > will incorporate this in the soon to be sent v2 patch. > > Thanks a lot for the interesting insights! Sounds good. I'll look forward to v2. Thanks. - Alex