Re: [Qemu-devel] [RFC PATCH] Exporting Guest RAM information for NUMA binding

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 22 Dec 2011 09:01:08 -0200

On Thu, Dec 01, 2011 at 06:40:31PM +0100, Peter Zijlstra wrote:
> On Wed, 2011-11-23 at 16:03 +0100, Andrea Arcangeli wrote:
> > Hi!
> > 
> > On Mon, Nov 21, 2011 at 07:51:21PM -0600, Anthony Liguori wrote:
> > > Fundamentally, the entity that should be deciding what memory should be present 
> > > and where it should located is the kernel.  I'm fundamentally opposed to trying 
> > > to make QEMU override the scheduler/mm by using cpu or memory pinning in QEMU.
> > > 
> > >  From what I can tell about ms_mbind(), it just uses process knowledge to bind 
> > > specific areas of memory to a memsched group and let's the kernel decide what to 
> > > do with that knowledge.  This is exactly the type of interface that QEMU should 
> > > be using.
> > > 
> > > QEMU should tell the kernel enough information such that the kernel can make 
> > > good decisions.  QEMU should not be the one making the decisions.
> > 
> > True, QEMU won't have to decide where the memory and vcpus should be
> > located (but hey it wouldn't need to decide that even if you use
> > cpusets, you can use relative mbind with cpusets, the admin or a
> > cpuset job scheduler could decide) but it's still QEMU making the
> > decision of what memory and which vcpus threads to
> > ms_mbind/ms_tbind. Think how you're going to create the input of those
> > syscalls...
> > 
> > If it wasn't qemu to decide that, qemu wouldn't be required to scan
> > the whole host physical numa (cpu/memory) topology in order to create
> > the "input" arguments of "ms_mbind/ms_tbind".
> 
> That's a plain falsehood, you don't need to scan host physcal topology
> in order to create useful ms_[mt]bind arguments. You can use physical
> topology to optimize for particular hardware, but its not a strict
> requirement.
> 
> >  And when you migrate the
> > VM to another host, the whole vtopology may be counter-productive
> > because the kernel isn't automatically detecting the numa affinity
> > between threads and the guest vtopology will stick to whatever numa
> > _physical_ topology that was seen on the first node where the VM was
> > created.
> 
> This doesn't make any sense at all.
> 
> > I doubt that the assumption that all cloud nodes will have the same
> > physical numa topology is reasonable.
> 
> So what? If you want to be very careful you can make sure you vnodes are
> small enough they fit any any physical node in your cloud (god I f*king
> hate that word).
> 
> If you're slightly less careful, things will still work, you might get
> less max parallelism, but typically (from what I understood) these VM
> hosting thingies are overloaded so you never get your max cpu anyway, so
> who cares.
> 
> Things is, whatever you set-up it will always work, it might not be
> optimal, but the one guarantee: [threads,vrange] will stay on the same
> node will be kept true, no matter where you run it.
> 
> Also, migration between non-identical hosts is always 'tricky'. You're
> always stuck with some minimally supported subset or average case thing.
> Really, why do you think NUMA would be any different.
> 
> > Furthermore to get the same benefits that qemu gets on host by using
> > ms_mbind/ms_tbind, every single guest application should be modified
> > to scan the guest vtopology and call ms_mbind/ms_tbind too (or use the
> > hard bindings which is what we try to avoid).
> 
> No! ms_[tm]bind() is just part of the solution, the other part is what
> to do for simple programs, and like I wrote in my email earlier, and
> what we talked about in Prague, is that for normal simple proglets we
> simply pick a numa node and stick to it. Much like:
> 
>  http://home.arcor.de/efocht/sched/
> 
> Except we could actually migrate the whole thing if needed. Basically
> you give each task its own 1 vnode and assign all threads to it.
> 
> Only big programs that need to span multiple nodes need to be modified
> to get best advantage of numa. But that has always been true.
> 
> > In my view the trouble of the numa hard bindings is not the fact
> > they're hard and qemu has to also decide the location (in fact it
> > doesn't need to decide the location if you use cpusets and relative
> > mbinds). The bigger problem is the fact either the admin or the app
> > developer has to explicitly scan the numa physical topology (both cpus
> > and memory) and tell the kernel how much memory to bind to each
> > thread. ms_mbind/ms_tbind only partially solve that problem. They're
> > similar to the mbind MPOL_F_RELATIVE_NODES with cpusets, except you
> > don't need an admin or a cpuset-job-scheduler (or a perl script) to
> > redistribute the hardware resources.
> 
> You're full of crap Andrea. 
> 
> Yes you need some clue as to your actual topology, but that's life, you
> can't get SMP for free either, you need to have some clue.
> 
> Just like with regular SMP where you need to be aware of data sharing,
> NUMA just makes it worse. If your app decomposes well enough to create a
> vnode per thread, that's excellent, if you want to scale your app to fit
> your machine that's fine too, heck, every multi-threaded app out there
> worth using already queries machine topology one way or another, its not
> a big deal.
> 
> But cpusets and relative_nodes doesn't work, you still get your memory
> splattered all over whatever nodes you allow and the scheduler will
> still move your task around based purely on cpu-load. 0-win.
> 
> Not needing a (userspace) job-scheduler is a win, because that avoids
> having everybody talk to this job-scheduler, and there's multiple
> job-schedulers out there, two can't properly co-exist, etc. Also, the
> kernel is the right place to do this.
> 
> [ this btw is true for all muddle-ware solutions, try and fit two
> applications together that are written against different but similar
> purpose muddle-wares and shit will come apart quickly ]
> 
> > Now dealing with bindings isn't big deal for qemu, in fact this API is
> > pretty much ideal for qemu, but it won't make life substantially
> > easier than if compared to hard bindings. Simply the management code
> > that is now done with a perl script will have to be moved in the
> > kernel. It looks an incremental improvement compared to the relative
> > mbind+cpuset, but I'm unsure if it's the best we could aim for and
> > what we really need in virt considering we deal with VM migration too.
> 
> No virt is crap, it needs to die, its horrid, and any solution aimed
> squarely at virt only is shit and not worth considering, that simple.

Removing this phrase from context (feel free to object on that basis
to the following inquiry), what are your concerns with virtualization
itself? Is it the fact that having an unknownable operating system under
your feet uncomfortable only, or is there something else? Because virt
is green, it saves silicon.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html