Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers

Ying Huang <ying.huang@xxxxxxxxx> · Wed, 08 Jun 2022 09:24:25 +0800

On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 2:22 PM, Ying Huang wrote:
> ....
> > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > > > explicit memory tiers"). With that, there will not be any regression in
> > > > > between the patch series.
> > > > > 
> > > > 
> > > > Thanks!  Please do that.  And I think you can add sysfs interface after
> > > > that patch too.  That is, in [1/7]
> > > > 
> > > 
> > > I am not sure why you insist on moving sysfs interfaces later. They are
> > > introduced based on the helper added. It make patch review easier to
> > > look at both the helpers and the user of the helper together in a patch.
> > 
> > Yes.  We should introduce a function and its user in one patch for
> > review.  But this doesn't mean that we should introduce the user space
> > interface as the first step.  I think the user space interface should
> > output correct information when we expose it.
> > 
> 
> If you look at this patchset we are not exposing any wrong information.
> 
> patch 1 -> adds ability to register the memory tiers and expose details 
> of registered memory tier. At this point the patchset only support DRAM 
> tier and hence only one tier is shown

But inside kernel, we actually work with 2 tiers and demote/prmote pages
between them.  With the information from your interface, users would
think that there is no any demotion/promotion in kernel because there's
only 1 tier.

> patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the 
> details, because the patchset yet has not introduced a slower memory 
> tier like PMEM.
> 
> patch 4 -> introducing demotion. Will make that patch 5
> 
> patch 5 -> add dax kmem numa nodes as slower memory tier. Now this 
> becomes patch 4 at which point we will correctly show two memory tiers 
> in the system.
> 
> 
> > > > +struct memory_tier {
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > > And struct device can be added after the kernel has switched the
> > > > implementation based on explicit memory tiers.
> > > > 
> > > > +struct memory_tier {
> > > > +	struct device dev;
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > 
> > > 
> > > Can you elaborate on this? or possibly review the v5 series indicating
> > > what change you are suggesting here?
> > > 
> > > 
> > > > But I don't think it's a good idea to have "struct device" embedded in
> > > > "struct memory_tier".  We don't have "struct device" embedded in "struct
> > > > pgdata_list"...
> > > > 
> > > 
> > > I avoided creating an array for memory_tier (memory_tier[]) so that we
> > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> > > the life cycle management of that dynamic list. We free the struct
> > > memory_tier allocation via device release function (memtier->dev.release
> > > = memory_tier_device_release )
> > > 
> > > Why do you think it is not a good idea?
> > 
> > I think that we shouldn't bind our kernel internal implementation with
> > user space interface too much.  Yes.  We can expose kernel internal
> > implementation to user space in a direct way.  I suggest you to follow
> > the style of "struct pglist_data" and "struct node".  If we decouple
> > "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> > we can refer to "struct memory_tier" without depending on all device
> > core.  Memory tier should be accessible inside the kernel even without a
> > user interface.  And memory tier isn't a device in concept.
> > 
> 
> memory_tiers are different from pglist_data and struct node in that we 
> also allow the creation of them from userspace.

I don't think that there's much difference.  struct pglist_data and
struct node can be created/destroyed dynamically too.  Please take a
look at

  __try_online_node()
  register_one_node()
  try_offline_node()
  unregister_one_node()

> That is the life time of 
> a memory tier is driven from userspace and it is much easier to manage 
> them via sysfs file lifetime mechanism rather than inventing an 
> independent and more complex way of doing the same.

You needs to manage the lifetime of struct memory_tier in kernel too. 
Because there are kernel users.  And even if you use device core
lifetime mechanism, you don't need to embed struct device in struct
memory_tier too, you can free "separate" struct memory_tier in "release"
callback of struct device.

> > For life cycle management, I think that we can do that without sysfs
> > too.
> > 
> 
> unless there are specific details that you think will be broken by 
> embedding struct device inside struct memory_tier, IMHO I still consider 
> the embedded implementation much simpler and in accordance with other 
> kernel design patterns.

In concept, struct memory_tier isn't a device.  Although we expose it as
a device in sysfs.  That's just an implementation detail.  So I think
it's better to make struct memory_tier independent of struct device if
possible.

Via not embeding struct device in struct memory_tier, it's much easier
to dereference struct memory_tier directly in inline function in ".h". 
We don't need to introduce one accessor function for each field of
struct memory_tier for that.

Best Regards,
Huang, Ying