>-----Original Message----- >From: Martin Kletzander <mkletzan@xxxxxxxxxx> >Sent: Wednesday, March 31, 2021 5:37 PM >To: Zhong, Luyao <luyao.zhong@xxxxxxxxx> >Cc: Daniel P. Berrangé <berrange@xxxxxxxxxx>; libvir-list@xxxxxxxxxx >Subject: Re: [PATCH v4 0/3] introduce 'restrictive' mode in numatune > >On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote: >> >> >>>-----Original Message----- >>>From: Martin Kletzander <mkletzan@xxxxxxxxxx> >>>Sent: Wednesday, March 31, 2021 12:21 AM >>>To: Zhong, Luyao <luyao.zhong@xxxxxxxxx> >>>Cc: Daniel P. Berrangé <berrange@xxxxxxxxxx>; libvir-list@xxxxxxxxxx >>>Subject: Re: [PATCH v4 0/3] introduce 'restrictive' mode in >>>numatune >>> >>>On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote: >>>> >>>> >>>>> -----Original Message----- >>>>> From: Martin Kletzander <mkletzan@xxxxxxxxxx> >>>>> Sent: Thursday, March 25, 2021 10:28 PM >>>>> To: Daniel P. Berrangé <berrange@xxxxxxxxxx> >>>>> Cc: Zhong, Luyao <luyao.zhong@xxxxxxxxx>; libvir-list@xxxxxxxxxx >>>>> Subject: Re: [PATCH v4 0/3] introduce 'restrictive' mode >>>>> in numatune >>>>> >>>>> On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote: >>>>> >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote: >>>>> >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote: >>>>> >> > >>>>> >> > >>>>> >> > > -----Original Message----- >>>>> >> > > From: Martin Kletzander <mkletzan@xxxxxxxxxx> >>>>> >> > > Sent: Thursday, March 25, 2021 4:46 AM >>>>> >> > > To: Daniel P. Berrangé <berrange@xxxxxxxxxx> >>>>> >> > > Cc: Zhong, Luyao <luyao.zhong@xxxxxxxxx>; >>>>> >> > > libvir-list@xxxxxxxxxx >>>>> >> > > Subject: Re: [PATCH v4 0/3] introduce 'restrictive' >>>>> >> > > mode in numatune >>>>> >> > > >>>>> >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé >wrote: >>>>> >> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: >>>>> >> > > >> Before this patch set, numatune only has three memory modes: >>>>> >> > > >> static, interleave and prefered. These memory policies >>>>> >> > > >> are ultimately set by mbind() system call. >>>>> >> > > >> >>>>> >> > > >> Memory policy could be 'hard coded' into the kernel, but >>>>> >> > > >> none of above policies fit our requirment under this case. >>>>> >> > > >> mbind() support default memory policy, but it requires a >>>>> >> > > >> NULL nodemask. So obviously setting allowed memory nodes >>>>> >> > > >> is >>>cgroups' >>>>> mission under this case. >>>>> >> > > >> So we introduce a new option for mode in numatune named >>>>> 'restrictive'. >>>>> >> > > >> >>>>> >> > > >> <numatune> >>>>> >> > > >> <memory mode="restrictive" nodeset="1-4,^3"/> >>>>> >> > > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> >>>>> >> > > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> >>>>> >> > > >> </numatune> >>>>> >> > > > >>>>> >> > > >'restrictive' is rather a wierd name and doesn't really >>>>> >> > > >tell me what the memory policy is going to be. As far as I >>>>> >> > > >can tell from the patches, it seems this causes us to not >>>>> >> > > >set any memory alllocation policy at all. IOW, we're using >>>>> >> > > >some undefined host default >>>>> policy. >>>>> >> > > > >>>>> >> > > >Given this I think we should be calling it either "none" or "default" >>>>> >> > > > >>>>> >> > > >>>>> >> > > I was against "default" because having such option possible, >>>>> >> > > but the actual default being different sounds stupid. >>>>> >> > > Similarly "none" sounds like no restrictions are applied or >>>>> >> > > that it is the same as if nothing was specified. It is >>>>> >> > > funny to imagine the situation when I am explaining to >>>>> >> > > someone how to >>>achieve this solution: >>>>> >> > > >>>>> >> > > "The default is 'strict', you need to explicitly set it to 'default'." >>>>> >> > > >>>>> >> > > or >>>>> >> > > >>>>> >> > > "What setting did you use?" >>>>> >> > > "None" >>>>> >> > > "As in no mode or in mode='none'?" >>>>> >> > > >>>>> >> > > As I said before, please come up with any name, but not >>>>> >> > > these that are IMHO actually more confusing. >>>>> >> > > >>>>> >> > >>>>> >> > Hi Daniel and Martin, thanks for your reply, just as Martin >>>>> >> > said current default mode is "strict", so "default" was >>>>> >> > deprecated at the beginning when I proposed this change. And >>>>> >> > actually we have cgroups restricting the memory resource so >>>>> >> > could we call this a "none" mode? I still don't have a better >>>>> >> > name. ☹ >>>>> >> > >>>>> >> >>>>> >> Me neither as figuring out the names when our names do not >>>>> >> precisely map to anything else (since we are using multiple >>>>> >> solutions to get as close to the desired result as possible) is >>>>> >> difficult because there is no similar pre-existing setting. And >>>>> >> using anything >>>like "cgroups-only" >>>>> >> would limit us in the future, probably. >>>>> > >>>>> >What I'm still really missing in this series is a clear statement >>>>> >of what the problem with the current modes is, and what this new >>>>> >mode provides to solve it. The documentation for the new XML >>>>> >attribute is not clear on this and neither are the commit >>>>> >messages. There's a pointer to an enourmous mailing list thread, >>>>> >but reading through >>>>> >50 messages is a not a viable way to learn the answer. >>>>> > >>>>> >I'm not even certain that we should be introducing a new mode >>>>> >value at all, as opposed to a separate attribute. >>>>> > >>>>> >>>>> Yes, Luyao, could you summarize the reason for the new mode? I >>>>> think that the difference in behaviour between using cgroups and >>>>> memory binding as opposed to just using cgroups should be enough >>>>> for others to be able to figure out when to use this mode and when not. >>>>> >>>>Sure. >>>>Let me give a concrete use case first. There is a new feature in >>>>kernel but not merged yet, we call it memory tiering. >>>>(https://lwn.net/Articles/802544/). If memory tiering is enabled on >>>>host, DRAM is top tier memory, and PMEM(persistent memory) is second >>>>tier memory, PMEM is shown as numa node without cpu. Pages can be >>>>migrated between DRAM node and PMEM node based on DRAM pressure and >>>how >>>>cold/hot they are. *this memory policy* is implemented in kernel. So >>>>we need a default mode here, but from libvirt's perspective, the "defaut" >>>>mode is "strict", it's not MPOL_DEFAULT >>>>(https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. >>>>Besides, to make memory tiering works well, cgroups setting is >>>>necessary, since it restricts that the pages can only be migrated >>>>between the >>>DRAM and PMEM nodes that we specified (NUMA affinity support). >>>> >>>>Except for upper use case, we might have some scenarios that only >>>>requires >>>cgroups restriction. >>>>That's why "restrictive" mode is proposed. >>>> >>>>In a word, if a user requires default mode(MPOL_DEFAULT) and require >>>>cgroups to restrict memory allocation, "restrictive" mode will be useful. >>>> >>> >>>Yeah, I also seem to recall something about the fact that just using >>>cgroups with multiple nodes in the nodeset makes kernel decide on >>>which node (out of those in the restricted set) to allocate on, but >>>specifying "strict" basically allocates it sequentially (on the first >>>one until it is full, then on the next one and so on). I do not have >>>anything to back this, so do you remember if this was that the case as well or >does my memory serve me poorly? >>> >>Yeah, exactly. 😊 >> >>cpuset.mems just specify the list of memory nodes on which the processes are >allowed to allocate memory. >>https://man7.org/linux/man-pages/man7/cpuset.7.html >> >>This link gives a detailed introduction of "strict" mode: >>https://man7.org/linux/man-pages/man2/mbind.2.html >> > >So, the behaviour I remembered was the case before Linux 2.6.26, not any more. >But anyway there are still some more differences: > Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think. >- The default setting uses system default memory policy, which is same > as 'bind' for most of the time. It is more close to 'interleave' > during the system boot (which does not concern us), but the fact that > it is the same as 'bind' might change in the future (as Luyao said). > >- If we change the memory policy (what happens with 'strict') then we > cannot change that later on as only the threads can change the > nodemask (or the policy) for themselves. AFAIK QEMU does not provide > an API for this, neither should it have the permissions to do it. > We, however, can do that if we just use cgroups. And 'virsh numatune' > already provides that for the whole domain (we just don't have an API > to do that per memory). > >These should definitely be noted in the documentation and, ideally, hinted at in >the commit message as well. I just do not know how to do that nicely without >just pointing to the libnuma man pages. > Yes, current doc is not clear enough. I'll try my best to explain the new mode in later patch update. @Daniel P. Berrangé, do you still have concern about what this mode is for and do you have any suggestion about this mode naming? >Thought? > >>>>BR, >>>>Luyao >>>> >>>>> >Regards, >>>>> >Daniel >>>>> >-- >>>>> >|: https://berrange.com -o- >>>https://www.flickr.com/photos/dberrange :| >>>>> >|: https://libvirt.org -o- https://fstop138.berrange.com :| >>>>> >|: https://entangle-photo.org -o- >>>>> https://www.instagram.com/dberrange :|