Re: + mm-introduce-reported-pages.patch added to -mm tree

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Fri, 8 Nov 2019 09:43:40 +0000

On Thu, Nov 07, 2019 at 08:07:02AM -0800, Alexander Duyck wrote:
> > And what I'm saying is that the initial version should have focused
> > exclusively on the mechanism to report free pages and kept the search as
> > simple as possible with optimisations on top. By all means track pages
> > that are already reported and skip them because that is a relatively basic
> > operation with the caveat that even the page_reported checks should be
> > behind a static branch if there is no virtio balloon driver loaded.
> 
> I can split out the list manipulation logic and put it behind a static
> branch, however I cannot do that with page_reported. The issue is that
> with keeping any state in the page we need to clear it when the page is
> allocated. I could do the static branch approach if we were to clear the
> bit for all pages, but I thought it better to just do a test and clear in
> order to avoid an unnecessary read-modify-write.
> 

A test and clear (the non-atomic variety) might be fine but I don't see
why the page_reported or zone flag test test cannot be behind a static
branch. If no virtio balloon driver with page reporting exists then it
does not matter if we track the state. If it's loaded for the first time,
no page has been updated but that's also ok, it now gets to start. Maybe
you are thinking about what happens if the driver is unloaded? If that
is your concern, do not disable the static branch once it is enabled. I
don't think we need to worry about handling driver unloading efficiently.

> > The page reporting to a hypervisor is done out of band from any application
> > so even if it takes a little longer to do that reporting, is there an
> > impact really? If so, how much and what frequency is this impact incurred?
> 
> My main concern has been time with the zone lock held in order to extract
> the page from the list. Any time while we are holding the zone lock is
> time when another CPU cannot free or allocate a page from that zone.
> Without the pointers I have to walk the free lists until I find a non-
> reported page in the zone. That is the overhead I am avoiding by
> maintaining the pointers since we have to walk the free lists for every
> batch of pages we want to pull.
> 

I'm not too worried about the zone lock hold time to be honest as long
as it's not long enough to cause soft lockups. The very nature of the
feature is that there is a performance penalty to favour packing more
VMs into a physical host. Even if the zone lock hold wasn't a factor,
the reallocation of pages after they have been freed by the hypervisor
will be a slowdown. Anyone using this feature has already accepted a
performance hit and this is why I'm not massively concerned about the
performance of the reporting itself.

> > The primary impact I can see is that free pages are invisible while the
> > notification takes place so the window for that may be wider if it takes
> > longer to find enough pages to send in batch but there will always be a
> > window where the free pages are invisible.
> 
> You are correct. There is a window where the pages will be invisible. The
> hope is that the impact of that should be fairly small since we only pull
> a few pages out at a time. In addition we will not pull past the watermark
> since the pages are pulled via __isolate_free_page.
> 

Indeed

> > > > <SNIP>
> > > > 
> > > > What confused me quite a lot is that this is enabled at compile time
> > > > and then incurs a performance hit whether there is a hypervisor that
> > > > even cares is involved or not. So I don't think the performance angle
> > > > justifying this approach is a good one because this implementation has
> > > > issues of its own. Previously I said
> > > 
> > > We are adding code. There is always going to be some performance impact.
> > 
> > And that impact should be negligible in so far as possible, parts of
> > it are protected by branches but as it stands, just building the virtio
> > balloon driver enables all of this whether the running system is using
> > virtual machines or not.
> 
> Agreed the code is present, however that would be the case with a static
> branch as well. I tried to use static branches where I thought it made
> sense and to just use the page_reported test for the cases where I
> couldn't. I can probably add the static branch to a few more spots however
> I am not sure it adds much value when it is already behind a page_reported
> check.
> 

It's simply an ideal from my perspective. Other features like cpusets
have zero impact if they are not in use.

> > > > Adding an API for compaction does not get away from the problem that
> > > > it'll be fragile to depend on the internal state of the allocator for
> > > > correctness. Existing users that poke into the state do so as an
> > > > optimistic shortcut but if it fails, nothing actually breaks. The free
> > > > list reporting stuff might and will not be routinely tested.
> > > 
> > > I view what I am doing as not being too different from that. I am only
> > > maintaining state when I can. I understand that it makes things more
> > > fragile as we have more routes where things could go wrong, but isn't that
> > > the case with adding any interface. I have simplified this about as far as
> > > it can go.
> > > 
> > 
> > The enhanced tracking of state is not necessary initially to simply report
> > free pages to the hypervisor. 
> 
> Define enhanced tracking?

Primarily the list manipulations and boundary pointers.

> If you are talking about the PageReported bit
> then I would disagree. I need to have some way to indicate that the free
> page has been reported or not if I am going to make forward progress while
> performing the reporting in iterations.

I'm not concerned about the PageReported bit. I fully accept that we
should not be isolating and re-reporting pages that have already been
handled by the hypervisor.

> > And again, it's not clear that the additional complexity is required.
> > Specifically, it'll be very hard to tell if the state tracking is
> > actually helping because excessive list shuffling due to compaction may
> > mean that a lot of state is being tracked while the full free list ends
> > up having to be searched anyway.
> > 
> > As it stands, the optimisations are hard-wired into the providing of
> > the feature itself making it an all or nothing approach and no option to
> > "merge a bare-bones mechanism that is behind static branches and build
> > optimisations on top". At least that way, any competing approach could
> > be based on optimisations alone while the feature is still available.
> 
> If needed I can split another patch out of patches 3 and 4 that does the
> list manipulation. It shouldn't take all that long to do as I have already
> done it once for testing to get the 20% regression number.

>From your perspective, I see it's a bit annoying because in the final
result, the code should be identical. However, it'll be a lot clearer
during review what is required, what level of complexity optimisations
add and the performance of it. The changelog should include what metric
you are using to evaluate the performance, the test case and the delta. It
also will be easier from a debugging perspective as minimally a bisection
could identify if a bug was due to the core mechanism itself or one of
the optimisations.  Finally, it leaves open the possibility that someone
can evaluate a completely different set of optimisations. Whatever the
alternative approaches are, the actual interface to virtio ballon surely
is the same (I don't actually know, I just can't see why the virtio ABI
would be depend on how the pages are isolated, tracked and reported).

> Also as I
> mentioned I can put the list manpiulation behind a static key, but cannot
> do so for the actual testing and clearing of the page reporting bit.
> 

Still not 100% sure why because until the driver is loaded, the bits
don't matter.

-- 
Mel Gorman
SUSE Labs