Re: [PATCH v3 0/6] mm / virtio: Provide support for unused page reporting

Alexander Duyck <alexander.h.duyck@xxxxxxxxxxxxxxx> · Fri, 02 Aug 2019 16:15:23 -0700

On Fri, 2019-08-02 at 10:28 -0700, Alexander Duyck wrote:
> On Fri, 2019-08-02 at 12:19 -0400, Nitesh Narayan Lal wrote:
> > On 8/2/19 11:13 AM, Alexander Duyck wrote:
> > > On Fri, 2019-08-02 at 10:41 -0400, Nitesh Narayan Lal wrote:
> > > > On 8/1/19 6:24 PM, Alexander Duyck wrote:
> > > > > 

<snip>

> > > > > One side effect of these patches is that the guest becomes much more
> > > > > resilient in terms of NUMA locality. With the pages being freed and then
> > > > > reallocated when used it allows for the pages to be much closer to the
> > > > > active thread, and as a result there can be situations where this patch
> > > > > set will out-perform the stock kernel when the guest memory is not local
> > > > > to the guest vCPUs.
> > > > Was this the reason because of which you were seeing better results for
> > > > page_fault1 earlier?
> > > Yes I am thinking so. What I have found is that in the case where the
> > > patches are not applied on the guest it takes a few runs for the numbers
> > > to stabilize. What I think was going on is that I was running memhog to
> > > initially fill the guest and that was placing all the pages on one node or
> > > the other and as such was causing additional variability as the pages were
> > > slowly being migrated over to the other node to rebalance the workload.
> > > One way I tested it was by trying the unpatched case with a direct-
> > > assigned device since that forces it to pin the memory. In that case I was
> > > getting bad results consistently as all the memory was forced to come from
> > > one node during the pre-allocation process.
> > > 
> > 
> > I have also seen that the page_fault1 values take some time to get stabilize on
> > an unmodified kernel.
> > What I am wondering here is that if on a single NUMA guest doing the following
> > will give the right/better idea or not:
> > 
> > 1. Pin the guest to a single NUMA node.
> > 2. Run memhog so that it touches all the guest memory.
> > 3. Run will-it-scale/page_fault1.
> > 
> > Compare/observe the values for the last core (this is considering the other core
> > values doesn't drastically differ).
> 
> I'll rerun the test with qemu affinitized to one specific socket. It will
> cut the core/thread count down to 8/16 on my test system. Also I will try
> with THP and page shuffling enabled.

Okay so results with 8/16 all affinitized to one socket, THP enabled
page_fault1, and shuffling enabled:

With page reporting disabled in the hypervisor there wasn't much
difference. I saw a range of 0.69% to -1.35% versus baseline, and an
average of 0.16% improvement. So effectively no change.

With page reporting enabled I saw a range of -2.10% to -4.50%, with an
average of -3.05% regression. This is much closer to what I would expect
for this patch set as the page faulting, double zeroing (once in host, and
once in guest), and hinting process itself should have some overhead.