Re: + mm-introduce-reported-pages.patch added to -mm tree

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Wed, 6 Nov 2019 22:11:50 +0000

On Wed, Nov 06, 2019 at 09:48:14AM -0800, Alexander Duyck wrote:
> On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@xxxxxxxxxx>:
> > > > > 
> > > > > ???I didn't have time to read through newer versions of this patch series
> > > > > but I remember there were concerns about this functionality being pulled
> > > > > into the page allocator previously both by me and Mel [1][2]. Have those been 
> > > > > addressed? I do not see an ack from Mel or any other MM people. Is there
> > > > > really a consensus that we want something like that living in the
> > > > > allocator?
> > > > 
> > > > I don???t think there is. The discussion is still ongoing (although quiet,
> > > > Nitesh is working on a new version AFAIK). I think we should not rush
> > > > this.
> > > 
> > > How much time is needed to get a review? I waited 2 weeks since posting
> > > v12 and the only comments I got on the code were from Andrew. Most of this
> > > hasn't changed much since v10 and that was posted back in mid September. I
> > > have been down to making small tweaks here and there and haven't had any
> > > real critiques on the approach since Mel had the comments about conflicts
> > > with compaction which I addressed by allowing compaction to punt the
> > > reporter out so that it could split and splice the lists as it walked
> > > through them.
> > 
> > Well, people are busy and MM community is not a large one. I cannot
> > really help you much other than keep poking those people and give
> > reasonable arguments so they decide to ack your patch.
> 
> I get that. But v10 was posted in mid September. Back then we had a
> discussion about addressing what Mel had mentioned and I had mentioned
> then that I had addressed it by allowing compaction to essentially reset
> the reporter to get it out of the list so compaction could do this split
> and splice tumbling logic.
> 

At the time I said "While in theory that could be addressed by always
going through an interface maintained by the page allocator, it would be
tricky to test the virtio case in particular."

Now, I get that you added an interface for that *but* if compaction was
ever updated or yet another approach was taken to deal with it, virtio
could get broken. If the page allocator itself has a bug or compaction
has a bug, the effect is very visible. While stuff does slip through, it
tends to be an obscure corner case. However, when virtio gets broken,
it'll be a long time before we get it.

Then we get onto the locking, the API appears to require the zone lock
be held which is not particularly obvious but it gets better. The zone
lock is a *heavy* lock now. By rights, it should be split into a
freelist and zone metadata lock as appropriate and then split the
freelist lock. Compaction could be updated and checked trivially, but
not so much virtio.

> > I definitely do not intent to nack this work, I just have maintainability
> > concerns and considering there is an alternative approach that does not
> > require to touch page allocator internals and which we need to compare
> > against then I do not really think there is any need to push something
> > in right away. Or is there any pressing reason to have this merged right
> > now?
> 
> The alternative approach doesn't touch the page allocator, however it
> still has essentially the same changes to __free_one_page. I suspect the
> performance issue seen is mostly due to the fact that because it doesn't
> touch the page allocator it is taking the zone lock and probing the page
> for each set bit to see if the page is still free. As such the performance
> regression seen gets worse the lower the order used for reporting.
> 

What confused me quite a lot is that this is enabled at compile time
and then incurs a performance hit whether there is a hypervisor that
even cares is involved or not. So I don't think the performance angle
justifying this approach is a good one because this implementation has
issues of its own. Previously I said

	I worry that poking too much into the internal state of the
	allocator will be fragile long-term. There is the arch alloc/free
	hooks but they are typically about protections only and does not
	interfere with the internal state of the allocator. Compaction
	pokes in as well but once the page is off the free list, the page
	allocator no longer cares so again there is on interference with
	the internal state. If the state is interefered with externally,
	it becomes unclear what happens if things like page merging is
	deferred in a way the allocator cannot control as high-order
	allocation requests may fail for example.

Adding an API for compaction does not get away from the problem that
it'll be fragile to depend on the internal state of the allocator for
correctness. Existing users that poke into the state do so as an
optimistic shortcut but if it fails, nothing actually breaks. The free
list reporting stuff might and will not be routinely tested.

Take compaction as an example, the initial implementation of it was dumb
as rocks and only started maintaining additional state and later poking
into the page allocator when there was empirical evidence it was necessary.

The initial implementation of page reporting should be the same, it
should do no harm at all to users that don't care (hiding behind
kconfig is not good enough, use static branches) and it should not
depend on the internal state of the page allocator and ordering of free
lists for correctness until it's shown it's absolutely necessary.

You say that the zone lock has to be taken in the alternative
implementation to check if it's still free and sure, that would cost
but unless you are isolating that page immediately then you are racing
once the lock is released. If you are isolating immediately, then isolate
pages in batches to amortise the loock costs.  The details of this could
be really hard but this approach is essentially saying "everything,
everywhere should take a small hit so the overhead is not noticeable for
virtio users" which is a curious choice for a new feature.

Regardless of the details of any implementation, the first one should be
basic, do no harm and be relatively simple giving just a bare interface
to virtio/qemu/etc. Then optimise it until such point as there is no
chance but to poke into the core.

-- 
Mel Gorman
SUSE Labs