Re: [RFC PATCH] add a new 'default' option for attribute mode in numatune

"Zhong, Luyao" <luyao.zhong@xxxxxxxxx> · Sat, 7 Nov 2020 10:41:52 +0800

On 11/4/2020 9:02 PM, Martin Kletzander wrote:
On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
On 10/16/2020 9:32 PM, Zang, Rui wrote:

How about if “migratable” is set, “mode” should be ignored/omitted? 
So any setting of “mode” will be rejected with an error indicating an 
invalid configuration.
We can say in the doc that “migratable” and “mode” shall not be set 
together. So even the default value of “mode” is not taken.

If "mode" is not set, it's the same as setting "strict" value ('strict'
is the default value). It involves some code detail, it will be
translated to enumerated type, the value is 0 when mode not set or set
to 'strict'. The code is in some fixed skeleton, so it's not easy to 
modify.

Well I see it as it is "strict". It does not mean "strict cgroup setting",
because cgroups are just one of the ways to enforce this.  Look at it 
this way:

mode can be:
  - strict: only these nodes can be used for the memory
  - preferred: there nodes should be preferred, but allocation should 
not fail
  - interleave: interleave the memory between these nodes

Due to the naming this maps to cgroup settings 1:1.

But now we have another way of enforcing this, using qemu cmdline 
option.  The
names actually map 1:1 to those as well:

https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901

So my idea was that we would add a movable/migratable/whatever attribute 
that
would tell us which way for enforcing we use because there does not seem 
to be
"one size fits all" solution.  Am I misunderstanding this discussion?  
Please
correct me if I am.  Thank you.

Actually I need a default memory policy(memory policy is 'hard coded' 
into the kernel) support, I thought "migratable" was enough to indicate 
that we rely on operating system to operate memory policy. So when 
"migratable" is set, "mode" should not be set. But when I was coding, I 
found "mode" default value is "strict", it is always "strict" even if 
"migratable" is yes, that means we configure two different memory 
policies at the same time. Then I still need a new option for "mode" to 
make it not conflicting with the "migratable", then if we have the new 
option("default") for "mode", it seems we can drop "migratable".

Besides, we can make "mode" being a "one size fits all" solution., just 
reject the different "mode" value config in memnode element when "mode" 
is "default" in memory element.

I summary it in the new email
https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html

Sorry I didn't make it easy to understand.

Regards,
Luyao
So I need a option to indicate "I don't specify any mode.".

在 2020年10月16日，20:34，Zhong, Luyao <luyao.zhong@xxxxxxxxx> 写道：

Hi Martin, Peter and other experts,

We got a consensus that we need introducing a new "migratable" 
attribute before. But in implementation, I found introducing a new 
'default' option for existing mode attribute is still neccessary.

I have a initial patch for 'migratable' and Peter gave some comments 
already.
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html

Current issue is, if I set 'migratable', any 'mode' should be 
ignored. Peter commented that I can't rely on docs to tell users 
some config is invalid, I need to reject the config in the code, I 
completely agree with that. But the 'mode' default value is 
'strict', it will always conflict with the 'migratable', at the end 
I still need introducing a new option for 'mode' which can be a 
legal config when 'migratable' is set.

If we have 'default' option, is 'migratable' still needed then?

FYI.
The 'mode' is corresponding to memory policy, there already a notion 
of default memory policy.
  quote:
    System Default Policy:  this policy is "hard coded" into the 
kernel.
(https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
So it might be easier to understand if we introduce a 'default' 
option directly.

Regards,
Luyao

On 8/26/2020 6:20 AM, Martin Kletzander wrote:
On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:

On 8/19/2020 11:24 PM, Martin Kletzander wrote:
On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:

-----Original Message-----
From: Martin Kletzander <mkletzan@xxxxxxxxxx>
Sent: Monday, August 17, 2020 4:58 PM
To: Zhong, Luyao <luyao.zhong@xxxxxxxxx>
Cc: libvir-list@xxxxxxxxxx; Zang, Rui <rui.zang@xxxxxxxxx>; Michal
Privoznik
<mprivozn@xxxxxxxxxx>
Subject: Re: [RFC PATCH] add a new 'default' option for
attribute mode
in numatune

On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:

On 8/7/2020 4:24 PM, Martin Kletzander wrote:
On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:

On 8/3/2020 7:00 PM, Martin Kletzander wrote:
On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
Hi Libvirt experts,

I would like enhence the numatune snippet configuration. 
Given a
example snippet:

Currently, attribute mode is either 'interleave', 
'strict', or
'preferred', I propose to add a new 'default'Ã‚Â  option. 
I give
the reason as following.

Presume we are using cgroups v1, Libvirt sets cpuset.mems 
for all
vcpu threads according to 'nodeset' in memory element. And
translate the memnode element to qemu config options 
(--object
memory-backend-ram) for per numa cell, which invoking mbind()
system call at the end.[1]

But what if we want using default memory policy and 
request each
guest numa cell pinned to different host memory nodes? We 
can't
use mbind via qemu config options, because (I quoto here) 
"For
MPOL_DEFAULT, the nodemask and maxnode arguments must be 
specify
the empty set of nodes." [2]

So my solution is introducing a new 'default' option for 
attribute
mode. e.g.

If the mode is 'default', libvirt should avoid generating 
qemu
command line '--object memory-backend-ram', and invokes 
cgroups to
set cpuset.mems for per guest numa combining with numa 
topology
config. Presume the numa topology is :

<cpu>
Ã‚Â ...
Ã‚Â <numa>
Ã‚Â Ã‚Â  <cell id='0' cpus='0-3' memory='512000' 
unit='KiB' /> Ã‚Â
Ã‚Â  <cell id='1' cpus='4-7' memory='512000' unit='KiB' /> 
Ã‚Â
</numa> Ã‚Â ...
</cpu>

Then libvirt should set cpuset.mems to '1' for vcpus 0-3, 
and '2'
for vcpus 4-7.

Is this reasonable and feasible? Welcome any comments.

There are couple of problems here.Ã‚Â  The memory is not 
(always)
allocated by the vCPU threads.Ã‚Â  I also remember it to 
not be
allocated by the process, but in KVM in a way that was not 
affected
by the cgroup settings.

Thanks for your reply. Maybe I don't get what you mean, 
could you
give me more context? But what I proposed will have no 
effect on
other memory allocation.

Check how cgroups work.Â  We can set the memory nodes that a 
process
will allocate from.Â  However to set the node for the process
(thread) QEMU needs to be started with the vCPU threads already
spawned (albeit stopped).Â  And for that QEMU already 
allocates some
memory.Â  Moreover if extra memory was allocated after we set 
the
cpuset.mems it is not guaranteed that it will be allocated by 
the
vCPU in that NUMA cell, it might be done in the emulator 
instead or
the KVM module in the kernel in which case it might not be 
accounted
for the process actually causing the allocation (as we've 
already
seen with Linux).Â  In all these cases cgroups will not do 
what you
want them to do.Â  The last case might be fixed, the first 
ones are
by default not going to work.

That might be
fixed now,
however.

But basically what we have against is all the reasons why we
started using QEMU's command line arguments for all that.

I'm not proposing use QEMU's command line arguments, on 
contrary I
want using cgroups setting to support a new 
config/requirement. I
give a solution about if we require default memory policy 
and memory
numa pinning.

And I'm suggesting you look at the commit log to see why we 
*had* to
add these command line arguments, even though I think I 
managed to
describe most of them above already (except for one that _might_
already be fixed in the kernel).Â  I understand the git log 
is huge
and the code around NUMA memory allocation was changing a 
lot, so I
hope my explanation will be enough.

Thank you for detailed explanation, I think I get it now. We 
can't
guarantee memory allocation matching requirement since there 
is a time
slot before setting cpuset.mems.

That's one of the things, although this one could be avoided (by
setting a global
cgroup before exec()).

Thanks,
Luyao
Sorry, but I think it will more likely break rather than 
fix stuff.
Maybe this
could be dealt with by a switch in `qemu.conf` with a huge 
warning
above it.

I'm not trying to fix something, I propose how to support a new
requirement just like I stated above.

I guess we should take a couple of steps back, I don't get 
what you
are trying to achieve.Â  Maybe if you describe your use case 
it will
be easier to reach a conclusion.

Yeah, I do have a usecase I didn't mention before. It's a 
feature in
kernel but not merged yet, we call it memory tiering.
(https://lwn.net/Articles/802544/)

If memory tiering is enabled on host, DRAM is top tier memory, 
and
PMEM(persistent memory) is second tier memory, PMEM is shown 
as numa
node without cpu. For short, pages can be migrated between 
DRAM and
PMEM based on DRAM pressure and how cold/hot they are.

We could configure multiple memory migrating path. For 
example, node 0:
DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 
to a
group, and 1+3 to a group. In each group, page is allowed to 
migrated
down(demotion) and up(promotion).

If **we want our VMs utilizing memory tiering and with NUMA 
topology**,
we need handle the guest memory mapping to host memory, that 
means we
need bind each guest numa node to a memory nodes group(DRAM 
node +
PMEM
node) on host. For example, guest node 0 -> host node 0+2.

However, only cgroups setting can make the memory tiering 
work, if we
use mbind() system call, demoted pages will never go back to 
DRAM.
That's why I propose to add 'default' option and bypass mbind 
in QEMU.

I hope I make myself understandable. I'll appreciate if you 
could give
some suggestion.

This comes around every couple of months/years and bites us in the
back no
matter what way we go (every time there is someone who wants it 
the
other
way).
That's why I think there could be a way for the user to specify
whether they will
likely move the memory or not and based on that we would 
specify `host-
nodes` and `policy` to qemu or not.  I think I even suggested this
before (or
probably delegated it to someone else for a suggestion so that 
there
is more
discussion), but nobody really replied.

So what we need, I think, is a way for someone to set a per-domain
information
whether we should bind the memory to nodes in a changeable 
fashion or
not.
I'd like to have it in as well.  The way we need to do that is,
probably, per-
domain, because adding yet another switch for each place in the 
XML
where we
can select a NUMA memory binding would be a suicide.  There should
also be
no need for this to be enabled per memory-(module, node), so it
should work
fine.

Thanks for letting us know your vision about this.
 From what I understood, the "changeable fashion" means that the 
guest
numa
cell binding can be changed out of band after initial binding, 
either
by system
admin or the operating system (memory tiering in our case), or
whatever the
third party is.  Is that perception correct?

Yes.  If the user wants to have the possibility of changing the 
binding,
then we
use *only* cgroups.  Otherwise we use the qemu parameters that 
will make
qemu
call mbind() (as that has other pros mentioned above).  The other 
option
would
be extra communication between QEMU and libvirt during start to 
let us
know when
to set what cgroups etc., but I don't think that's worth it.

It seems to me mbind() or set_mempolicy() system calls do not 
offer that
flexibility of changing afterwards. So in case of QEMU/KVM, I 
can only
think
of cgroups.
So to be specific, if we had this additional 
"memory_binding_changeable"
option specified, we will try to do the guest numa constraining via
cgroups
whenever possible. There will probably also be conflicts in 
options or
things
that cgroups can not do. For such cases we'd fail the domain.

Basically we'll do what we're doing now and skip the qemu 
`host-nodes` and
`policy` parameters with the new option.  And of course we can 
fail with
a nice
error message if someone wants to move the memory without the option
selected
and so on.

Thanks for your comments.

I'd like get it more clear about defining the interface in domain 
xml,
then I could go into the implementation further.

As you mentioned, per-domain option will be better than per-node. 
I go
through the libvirt doamin format to look for a proper position to 
place
this option. Then I'm thinking we could still utilizing numatune 
element
to configure.

<numatune>
   <memory mode="strict" nodeset="1-4,^3"/>
   <memnode cellid="0" mode="strict" nodeset="1"/>
   <memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>

coincidentally, the optional memory element specifies how to allocate
memory for the domain process on a NUMA host. So can we utilizing 
this
element, and introducing a new mode like "changeable" or whatever? Do
you have a better name?

Yeah, I was thinking something along the lines of:
<numatune>
    <memory mode="strict" nodeset="1-4,^3" 
movable/migratable="yes/no" />
    <memnode cellid="0" mode="strict" nodeset="1"/>
    <memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>
If the memory mode is set to 'changeable', we could ignore the mode
setting for each memnode, and then we only configure by cgroups. I 
have
not diven into code for now, expecting it could work.

Yes, the example above gives the impression of the attribute being 
available
per-node.  But that could be handled in the documentation.
Specifying it per-node seems very weird, why would you want the 
memory to be
hard-locked, but for some guest nodes only?
Thanks,
Luyao

If you agree with the direction, I think we can dig deeper to 
see what
will
come out.

Regards,
Zang, Rui

Ideally we'd discuss it with others, but I think I am only one 
of a
few people
who dealt with issues in this regard.  Maybe Michal (Cc'd) also 
dealt
with some
things related to the binding, so maybe he can chime in.

regards,
Luyao

Have a nice day,
Martin

Regards,
Luyao

[1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
a97c3f2bad5/backends/hostmem.c#L379

[2]https://man7.org/linux/man-pages/man2/mbind.2.html

--
2.25.1