-----Original Message-----
From: Martin Kletzander <mkletzan@xxxxxxxxxx>
Sent: Monday, August 17, 2020 4:58 PM
To: Zhong, Luyao <luyao.zhong@xxxxxxxxx>
Cc: libvir-list@xxxxxxxxxx; Zang, Rui <rui.zang@xxxxxxxxx>; Michal
Privoznik
<mprivozn@xxxxxxxxxx>
Subject: Re: [RFC PATCH] add a new 'default' option for
attribute mode
in numatune
On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>
>
>On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>
>>>
>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>>>>> Hi Libvirt experts,
>>>>>
>>>>> I would like enhence the numatune snippet configuration. Given a
>>>>> example snippet:
>>>>>
>>>>> <domain>
>>>>>  ...
>>>>>  <numatune>
>>>>>   <memory mode="strict" nodeset="1-4,^3"/>  ÂÂ
>>>>> <memnode cellid="0" mode="strict" nodeset="1"/>  ÂÂ
<memnode
>>>>> cellid="2" mode="preferred" nodeset="2"/>  </numatune>
 ...
>>>>> </domain>
>>>>>
>>>>> Currently, attribute mode is either 'interleave', 'strict', or
>>>>> 'preferred', I propose to add a new 'default' option. I give
>>>>> the reason as following.
>>>>>
>>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems for
all
>>>>> vcpu threads according to 'nodeset' in memory element. And
>>>>> translate the memnode element to qemu config options (--object
>>>>> memory-backend-ram) for per numa cell, which invoking mbind()
>>>>> system call at the end.[1]
>>>>>
>>>>> But what if we want using default memory policy and request each
>>>>> guest numa cell pinned to different host memory nodes? We can't
>>>>> use mbind via qemu config options, because (I quoto here) "For
>>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be specify
>>>>> the empty set of nodes." [2]
>>>>>
>>>>> So my solution is introducing a new 'default' option for
attribute
>>>>> mode. e.g.
>>>>>
>>>>> <domain>
>>>>>  ...
>>>>>  <numatune>
>>>>>   <memory mode="default" nodeset="1-2"/>  ÂÂ
<memnode
>>>>> cellid="0" mode="default" nodeset="1"/>   <memnode
>>>>> cellid="1" mode="default" nodeset="2"/>  </numatune>  ...
>>>>> </domain>
>>>>>
>>>>> If the mode is 'default', libvirt should avoid generating qemu
>>>>> command line '--object memory-backend-ram', and invokes
cgroups to
>>>>> set cpuset.mems for per guest numa combining with numa topology
>>>>> config. Presume the numa topology is :
>>>>>
>>>>> <cpu>
>>>>>  ...
>>>>>  <numa>
>>>>>   <cell id='0' cpus='0-3' memory='512000' unit='KiB'
/> ÂÂ
>>>>>  <cell id='1' cpus='4-7' memory='512000' unit='KiB' /> ÂÂ
>>>>> </numa>  ...
>>>>> </cpu>
>>>>>
>>>>> Then libvirt should set cpuset.mems to '1' for vcpus 0-3, and
'2'
>>>>> for vcpus 4-7.
>>>>>
>>>>>
>>>>> Is this reasonable and feasible? Welcome any comments.
>>>>>
>>>>
>>>> There are couple of problems here. The memory is not (always)
>>>> allocated by the vCPU threads. I also remember it to not be
>>>> allocated by the process, but in KVM in a way that was not
affected
>>>> by the cgroup settings.
>>>
>>> Thanks for your reply. Maybe I don't get what you mean, could you
>>> give me more context? But what I proposed will have no effect on
>>> other memory allocation.
>>>
>>
>> Check how cgroups work. We can set the memory nodes that a
process
>> will allocate from. However to set the node for the process
>> (thread) QEMU needs to be started with the vCPU threads already
>> spawned (albeit stopped). And for that QEMU already allocates
some
>> memory. Moreover if extra memory was allocated after we set the
>> cpuset.mems it is not guaranteed that it will be allocated by the
>> vCPU in that NUMA cell, it might be done in the emulator instead or
>> the KVM module in the kernel in which case it might not be
accounted
>> for the process actually causing the allocation (as we've already
>> seen with Linux). In all these cases cgroups will not do what you
>> want them to do. The last case might be fixed, the first ones are
>> by default not going to work.
>>
>>>> That might be
>>>> fixed now,
>>>> however.
>>>>
>>>> But basically what we have against is all the reasons why we
>>>> started using QEMU's command line arguments for all that.
>>>>
>>> I'm not proposing use QEMU's command line arguments, on contrary I
>>> want using cgroups setting to support a new config/requirement. I
>>> give a solution about if we require default memory policy and
memory
>>> numa pinning.
>>>
>>
>> And I'm suggesting you look at the commit log to see why we
*had* to
>> add these command line arguments, even though I think I managed to
>> describe most of them above already (except for one that _might_
>> already be fixed in the kernel). I understand the git log is huge
>> and the code around NUMA memory allocation was changing a lot, so I
>> hope my explanation will be enough.
>>
>Thank you for detailed explanation, I think I get it now. We can't
>guarantee memory allocation matching requirement since there is a
time
>slot before setting cpuset.mems.
>
That's one of the things, although this one could be avoided (by
setting a global
cgroup before exec()).
>>> Thanks,
>>> Luyao
>>>> Sorry, but I think it will more likely break rather than fix
stuff.
>>>> Maybe this
>>>> could be dealt with by a switch in `qemu.conf` with a huge
warning
>>>> above it.
>>>>
>>> I'm not trying to fix something, I propose how to support a new
>>> requirement just like I stated above.
>>>
>>
>> I guess we should take a couple of steps back, I don't get what you
>> are trying to achieve. Maybe if you describe your use case it
will
>> be easier to reach a conclusion.
>>
>Yeah, I do have a usecase I didn't mention before. It's a feature in
>kernel but not merged yet, we call it memory tiering.
>(https://lwn.net/Articles/802544/)
>
>If memory tiering is enabled on host, DRAM is top tier memory, and
>PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>node without cpu. For short, pages can be migrated between DRAM and
>PMEM based on DRAM pressure and how cold/hot they are.
>
>We could configure multiple memory migrating path. For example,
node 0:
>DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>group, and 1+3 to a group. In each group, page is allowed to migrated
>down(demotion) and up(promotion).
>
>If **we want our VMs utilizing memory tiering and with NUMA
topology**,
>we need handle the guest memory mapping to host memory, that means we
>need bind each guest numa node to a memory nodes group(DRAM node +
PMEM
>node) on host. For example, guest node 0 -> host node 0+2.
>
>However, only cgroups setting can make the memory tiering work, if we
>use mbind() system call, demoted pages will never go back to DRAM.
>That's why I propose to add 'default' option and bypass mbind in
QEMU.
>
>I hope I make myself understandable. I'll appreciate if you could
give
>some suggestion.
>
This comes around every couple of months/years and bites us in the
back no
matter what way we go (every time there is someone who wants it the
other
way).
That's why I think there could be a way for the user to specify
whether they will
likely move the memory or not and based on that we would specify
`host-
nodes` and `policy` to qemu or not. I think I even suggested this
before (or
probably delegated it to someone else for a suggestion so that there
is more
discussion), but nobody really replied.
So what we need, I think, is a way for someone to set a per-domain
information
whether we should bind the memory to nodes in a changeable fashion or
not.
I'd like to have it in as well. The way we need to do that is,
probably, per-
domain, because adding yet another switch for each place in the XML
where we
can select a NUMA memory binding would be a suicide. There should
also be
no need for this to be enabled per memory-(module, node), so it
should work
fine.