Re: [PATCH v11 0/6] mm / virtio: Provide support for unused page reporting

Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> · Mon, 07 Oct 2019 08:33:04 -0700

On Mon, 2019-10-07 at 08:29 -0400, Nitesh Narayan Lal wrote:
> On 10/2/19 10:25 AM, Alexander Duyck wrote:
> 
> [...]
> > > > My suggestion would be to look at reworking the patch set and
> > > > post numbers for my patch set versus the bitmap approach and we can
> > > > look at them then.
> > > Agreed. However, in order to fix an issue I have to reproduce it first.
> > With the tweak I have suggested above it should make it much easier to
> > reproduce. Basically all you need is to have the allocation competing
> > against hinting. Currently the hinting isn't doing this because the
> > allocations are mostly coming out of 4K pages instead of higher order
> > ones.
> > 
> > Alternatively you could just make the suggestion I had proposed about
> > using spin_lock/unlock_irq in your worker thread and that resolved it
> > for me.
> > 
> > > >  I would prefer not to spend my time fixing and
> > > > tuning a patch set that I am still not convinced is viable.
> > > You  don't have to, I can fix the issues in my patch-set. :)
> > Sounds good. Hopefully the stuff I pointed out above helps you to get
> > a reproduction and resolve the issues.
> 
> So I did observe a significant drop in running my v12 path-set [1] with the
> suggested test setup. However, on making certain changes the performance
> improved significantly.
> 
> I used my v12 patch-set which I have posted earlier and made the following
> changes:
> 1. Started reporting only (MAX_ORDER - 1) pages and increased the number of
>     pages that can be reported at a time to 32 from 16. The intent of making
>     these changes was to bring my configuration closer to what Alexander is
>     using.

The increase from 16 to 32 is valid. No point in working in too small of
batches. However tightening the order to only test for MAX_ORDER - 1 seems
like a step in the wrong direction. The bitmap approach doesn't have much
value if it can only work with the highest order page. I realize it is
probably necessary in order to make the trick for checking on page_buddy
work, but it seems very limiting.

> 2. I made an additional change in my bitmap scanning logic to prevent acquiring
>     spinlock if the page is already allocated.

Again, not a fan. It basically means you can only work with MAX_ORDER - 1
and there will be no ability to work with anything smaller.

> 
> Setup:
> On a 16 vCPU 30 GB single NUMA guest affined to a single host NUMA, I ran the
> modified will-it-scale/page_fault number of times and calculated the average
> of the number of process and threads launched on the 16th core to compare the
> impact of my patch-set against an unmodified kernel.
> 
> 
> Conclusion:
> %Drop in number of processes launched on 16th vCPU =     1-2%
> %Drop in number of threads launched on 16th vCPU     =     5-6%

These numbers don't make that much sense to me. Are you talking about a
fully functioning setup that is madvsing away the memory in the
hypervisor? If so I would have expected a much higher difference versus
baseline as zeroing/faulting the pages in the host gets expensive fairly
quick. What is the host kernel you are running your test on? I'm just
wondering if there is some additional overhead currently limiting your
setup. My host kernel was just the same kernel I was running in the guest,
just built without the patches applied.

> Other observations:
> - I also tried running Alexander's latest v11 page-reporting patch set and
>   observe a similar amount of average degradation in the number of processes
>   and threads.
> - I didn't include the linear component recorded by will-it-scale because for
>   some reason it was fluctuating too much even when I was using an unmodified
>   kernel. If required I can investigate this further.
> 
> Note: If there is a better way to analyze the will-it-scale/page_fault results
> then please do let me know.

Honestly I have mostly just focused on the processes performance. There is
usually a fair bit of variability but a pattern forms after a few runs so
you can generally tell if a configuration is an improvement or not.

> Other setup details:
> Following are the configurations which I enabled to run my tests:
> - Enabled: CONFIG_SLAB_FREELIST_RANDOM & CONFIG_SHUFFLE_PAGE_ALLOCATOR
> - Set host THP to always
> - Set guest THP to madvise
> - Added the suggested madvise call in page_fault source code.
> @Alexander please let me know if I missed something.

This seems about right.

> The current state of my v13:
> I still have to look into Michal's suggestion of using page-isolation API's
> instead of isolating the page. However, I believe at this moment our objective
> is to decide with which approach we can proceed and that's why I decided to
> post the numbers by making small required changes in v12 instead of posting a
> new series.
> 
> 
> Following are the changes which I have made on top of my v12:
> 
> page_reporting.h change:
> -#define PAGE_REPORTING_MIN_ORDER               (MAX_ORDER - 2)
> -#define PAGE_REPORTING_MAX_PAGES               16
> +#define PAGE_REPORTING_MIN_ORDER              (MAX_ORDER - 1)
> +#define PAGE_REPORTING_MAX_PAGES              32
> 
> page_reporting.c change:
> @@ -101,8 +101,12 @@ static void scan_zone_bitmap(struct page_reporting_config
> *phconf,
>                 /* Process only if the page is still online */
>                 page = pfn_to_online_page((setbit << PAGE_REPORTING_MIN_ORDER) +
>                                           zone->base_pfn);
> -               if (!page)
> +               if (!page || !PageBuddy(page)) {
> +                       clear_bit(setbit, zone->bitmap);
> +                       atomic_dec(&zone->free_pages);
>                         continue;
> +               }
> 

I suspect the zone->free_pages is going to be expensive for you to deal
with. It is a global atomic value and is going to have the cacheline
bouncing that it is contained in. As a result thinks like setting the
bitmap with be more expensive as every tome a CPU increments free_pages it
will likely have to take the cache line containing the bitmap pointer as
well.

> @Alexander in case you decide to give it a try and find different results,
> please do let me know.
> 
> [1] https://lore.kernel.org/lkml/20190812131235.27244-1-nitesh@xxxxxxxxxx/
> 
> 

If I have some free time I will take a look. However one thing that
concerns me about this change is that it will limit things much further in
terms of how much memory can ultimately be freed since you are now only
working with the highest order page and that becomes a hard requirement
for your design.