Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> · Thu, 18 Apr 2019 09:24:35 -0700

On 4/17/19 10:51 AM, Michal Hocko wrote:
On Wed 17-04-19 10:26:05, Yang Shi wrote:
On 4/17/19 9:39 AM, Michal Hocko wrote:
On Wed 17-04-19 09:37:39, Keith Busch wrote:
On Wed, Apr 17, 2019 at 05:39:23PM +0200, Michal Hocko wrote:
On Wed 17-04-19 09:23:46, Keith Busch wrote:
On Wed, Apr 17, 2019 at 11:23:18AM +0200, Michal Hocko wrote:
On Tue 16-04-19 14:22:33, Dave Hansen wrote:
Keith Busch had a set of patches to let you specify the demotion order
via sysfs for fun.  The rules we came up with were:
I am not a fan of any sysfs "fun"
I'm hung up on the user facing interface, but there should be some way a
user decides if a memory node is or is not a migrate target, right?
Why? Or to put it differently, why do we have to start with a user
interface at this stage when we actually barely have any real usecases
out there?
The use case is an alternative to swap, right? The user has to decide
which storage is the swap target, so operating in the same spirit.
I do not follow. If you use rebalancing you can still deplete the memory
and end up in a swap storage. If you want to reclaim/swap rather than
rebalance then you do not enable rebalancing (by node_reclaim or similar
mechanism).
I'm a little bit confused. Do you mean just do *not* do reclaim/swap in
rebalancing mode? If rebalancing is on, then node_reclaim just move the
pages around nodes, then kswapd or direct reclaim would take care of swap?
Yes, that was the idea I wanted to get through. Sorry if that was not
really clear.

If so the node reclaim on PMEM node may rebalance the pages to DRAM node?
Should this be allowed?
Why it shouldn't? If there are other vacant Nodes to absorb that memory
then why not use it?

I think both I and Keith was supposed to treat PMEM as a tier in the reclaim
hierarchy. The reclaim should push inactive pages down to PMEM, then swap.
So, PMEM is kind of a "terminal" node. So, he introduced sysfs defined
target node, I introduced N_CPU_MEM.
I understand that. And I am trying to figure out whether we really have
to tream PMEM specially here. Why is it any better than a generic NUMA
rebalancing code that could be used for many other usecases which are
not PMEM specific. If you present PMEM as a regular memory then also use
it as a normal memory.

This also makes some sense. We just look at PMEM from different point of 
view. Taking into account the performance disparity may outweigh 
treating it as a normal memory in this patchset.

A ridiculous idea, may we have two modes? One for "rebalancing", the 
other for "demotion"?