-----Original Message-----
From: Martin Kletzander <mkletzan@xxxxxxxxxx>
Sent: Monday, August 17, 2020 4:58 PM
To: Zhong, Luyao <luyao.zhong@xxxxxxxxx>
Cc: libvir-list@xxxxxxxxxx; Zang, Rui <rui.zang@xxxxxxxxx>; Michal
Privoznik
<mprivozn@xxxxxxxxxx>
Subject: Re: [RFC PATCH] add a new 'default' option for
attribute mode
in numatune
On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
On 8/7/2020 4:24 PM, Martin Kletzander wrote:
On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
On 8/3/2020 7:00 PM, Martin Kletzander wrote:
On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
Hi Libvirt experts,
I would like enhence the numatune snippet configuration. Given a
example snippet:
<domain>
 ...
 <numatune>
  <memory mode="strict" nodeset="1-4,^3"/>  ÂÂ
<memnode cellid="0" mode="strict" nodeset="1"/>   <memnode
cellid="2" mode="preferred" nodeset="2"/>  </numatune>  ...
</domain>
Currently, attribute mode is either 'interleave', 'strict', or
'preferred', I propose to add a new 'default' option. I give
the reason as following.
Presume we are using cgroups v1, Libvirt sets cpuset.mems for all
vcpu threads according to 'nodeset' in memory element. And
translate the memnode element to qemu config options (--object
memory-backend-ram) for per numa cell, which invoking mbind()
system call at the end.[1]
But what if we want using default memory policy and request each
guest numa cell pinned to different host memory nodes? We can't
use mbind via qemu config options, because (I quoto here) "For
MPOL_DEFAULT, the nodemask and maxnode arguments must be specify
the empty set of nodes." [2]
So my solution is introducing a new 'default' option for attribute
mode. e.g.
<domain>
 ...
 <numatune>
  <memory mode="default" nodeset="1-2"/>   <memnode
cellid="0" mode="default" nodeset="1"/>   <memnode
cellid="1" mode="default" nodeset="2"/>  </numatune>  ...
</domain>
If the mode is 'default', libvirt should avoid generating qemu
command line '--object memory-backend-ram', and invokes cgroups to
set cpuset.mems for per guest numa combining with numa topology
config. Presume the numa topology is :
<cpu>
 ...
 <numa>
  <cell id='0' cpus='0-3' memory='512000' unit='KiB' /> ÂÂ
 <cell id='1' cpus='4-7' memory='512000' unit='KiB' /> ÂÂ
</numa>  ...
</cpu>
Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and '2'
for vcpus 4-7.
Is this reasonable and feasible? Welcome any comments.
There are couple of problems here. The memory is not (always)
allocated by the vCPU threads. I also remember it to not be
allocated by the process, but in KVM in a way that was not affected
by the cgroup settings.
Thanks for your reply. Maybe I don't get what you mean, could you
give me more context? But what I proposed will have no effect on
other memory allocation.
Check how cgroups work. We can set the memory nodes that a process
will allocate from. However to set the node for the process
(thread) QEMU needs to be started with the vCPU threads already
spawned (albeit stopped). And for that QEMU already allocates some
memory. Moreover if extra memory was allocated after we set the
cpuset.mems it is not guaranteed that it will be allocated by the
vCPU in that NUMA cell, it might be done in the emulator instead or
the KVM module in the kernel in which case it might not be accounted
for the process actually causing the allocation (as we've already
seen with Linux). In all these cases cgroups will not do what you
want them to do. The last case might be fixed, the first ones are
by default not going to work.
That might be
fixed now,
however.
But basically what we have against is all the reasons why we
started using QEMU's command line arguments for all that.
I'm not proposing use QEMU's command line arguments, on contrary I
want using cgroups setting to support a new config/requirement. I
give a solution about if we require default memory policy and memory
numa pinning.
And I'm suggesting you look at the commit log to see why we *had* to
add these command line arguments, even though I think I managed to
describe most of them above already (except for one that _might_
already be fixed in the kernel). I understand the git log is huge
and the code around NUMA memory allocation was changing a lot, so I
hope my explanation will be enough.
Thank you for detailed explanation, I think I get it now. We can't
guarantee memory allocation matching requirement since there is a time
slot before setting cpuset.mems.
That's one of the things, although this one could be avoided (by
setting a global
cgroup before exec()).
Thanks,
Luyao
Sorry, but I think it will more likely break rather than fix stuff.
Maybe this
could be dealt with by a switch in `qemu.conf` with a huge warning
above it.
I'm not trying to fix something, I propose how to support a new
requirement just like I stated above.
I guess we should take a couple of steps back, I don't get what you
are trying to achieve. Maybe if you describe your use case it will
be easier to reach a conclusion.
Yeah, I do have a usecase I didn't mention before. It's a feature in
kernel but not merged yet, we call it memory tiering.
(https://lwn.net/Articles/802544/)
If memory tiering is enabled on host, DRAM is top tier memory, and
PMEM(persistent memory) is second tier memory, PMEM is shown as numa
node without cpu. For short, pages can be migrated between DRAM and
PMEM based on DRAM pressure and how cold/hot they are.
We could configure multiple memory migrating path. For example, node 0:
DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
group, and 1+3 to a group. In each group, page is allowed to migrated
down(demotion) and up(promotion).
If **we want our VMs utilizing memory tiering and with NUMA topology**,
we need handle the guest memory mapping to host memory, that means we
need bind each guest numa node to a memory nodes group(DRAM node +
PMEM
node) on host. For example, guest node 0 -> host node 0+2.
However, only cgroups setting can make the memory tiering work, if we
use mbind() system call, demoted pages will never go back to DRAM.
That's why I propose to add 'default' option and bypass mbind in QEMU.
I hope I make myself understandable. I'll appreciate if you could give
some suggestion.
This comes around every couple of months/years and bites us in the
back no
matter what way we go (every time there is someone who wants it the
other
way).
That's why I think there could be a way for the user to specify
whether they will
likely move the memory or not and based on that we would specify `host-
nodes` and `policy` to qemu or not. I think I even suggested this
before (or
probably delegated it to someone else for a suggestion so that there
is more
discussion), but nobody really replied.
So what we need, I think, is a way for someone to set a per-domain
information
whether we should bind the memory to nodes in a changeable fashion or
not.
I'd like to have it in as well. The way we need to do that is,
probably, per-
domain, because adding yet another switch for each place in the XML
where we
can select a NUMA memory binding would be a suicide. There should
also be
no need for this to be enabled per memory-(module, node), so it
should work
fine.