Re: [PATCH RFC 04/12] kernel/user: Allow user::locked_vm to be usable for iommufd

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 28 Mar 2022 11:27:29 -0300

On Mon, Mar 28, 2022 at 02:14:26PM +0100, Sean Mooney wrote:
> On Mon, 2022-03-28 at 09:53 +0800, Jason Wang wrote:
> > On Thu, Mar 24, 2022 at 7:46 PM Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> > > 
> > > On Thu, Mar 24, 2022 at 11:50:47AM +0800, Jason Wang wrote:
> > > 
> > > > It's simply because we don't want to break existing userspace. [1]
> > > 
> > > I'm still waiting to hear what exactly breaks in real systems.
> > > 
> > > As I explained this is not a significant change, but it could break
> > > something in a few special scenarios.
> > > 
> > > Also the one place we do have ABI breaks is security, and ulimit is a
> > > security mechanism that isn't working right. So we do clearly need to
> > > understand *exactly* what real thing breaks - if anything.
> > > 
> > > Jason
> > > 
> > 
> > To tell the truth, I don't know. I remember that Openstack may do some
> > accounting so adding Sean for more comments. But we really can't image
> > openstack is the only userspace that may use this.
> sorry there is a lot of context to this discussion i have tried to read back the
> thread but i may have missed part of it.

Thanks Sean, this is quite interesting, though I'm not sure it
entirely reached the question

> tl;dr openstack does not currently track locked/pinned memory per
> use or per vm because we have no idea when libvirt will request it
> or how much is needed per device. when ulimits are configured today
> for nova/openstack its done at teh qemu user level outside of
> openstack in our installer tooling.  e.g. in tripleo the ulimits
> woudl be set on the nova_libvirt contaienr to constrain all vms
> spawned not per vm/process.

So, today, you expect the ulimit to be machine wide, like if your
machine has 1TB of memory you'd set the ulimit at 0.9 TB and you'd
like the stuff under to limit memory pinning to 0.9TB globally for all
qemus?

To be clear it doesn't work that way today at all, you might as well
just not bother setting ulimit to anything less than unlimited at the
openstack layer.

>    hard_limit
>    
>        The optional hard_limit element is the maximum memory the
>    guest can use. The units for this value are kibibytes
>    (i.e. blocks of 1024 bytes). Users of QEMU and KVM are strongly
>    advised not to set this limit as domain may get killed by the
>    kernel if the guess is too low, and determining the memory needed
>    for a process to run is an undecidable problem; that said, if you
>    already set locked in memory backing because your workload
>    demands it, you'll have to take into account the specifics of
>    your deployment and figure out a value for hard_limit that is
>    large enough to support the memory requirements of your guest,
>    but small enough to protect your host against a malicious guest
>    locking all memory.

And hard_limit is the ulimit that Alex was talking about?

So now we switched from talking about global per-user things to
per-qemu-instance things?

> we coudl not figure out how to automatically comptue a hard_limit in
> nova that would work for everyone and we felt exposign this to our
> users/operators was bit of a cop out when they likely cant caluate
> that properly either.

Not surprising..

> As a result we cant actully account for them
> today when schduilign workloads to a host. Im not sure this woudl
> chagne even if you exposed new user space apis unless we  had a way
> to inspect each VF to know how much locked memory that VF woudl need
> to lock?

We are not talking about a new uAPI we are talking about changing the
meaning of the existing ulimit. You can see it in your message above,
at the openstack level you were talking about global limits and then
in the libvirt level you are talking about per-qemu limits.

In the kernel both of these are being used by the same control and one
of the users is wrong.

The kernel consensus is that the ulimit is per-user and is used by all
kernel entities consistently

Currently vfio is different and uses it per-process and effectively
has its own private bucket.

When you talk about VDPA you start to see the problems here because
VPDA use a different accounting from VFIO. If you run VFIO and VDPA
together then you should need 2x the ulimit, but today you only need
1x because they don't share accounting buckets.

This also means the ulimit doesn't actually work the way it is
supposed to.

The question is how to fix it, if we do fix it, how much cares that
things work differently.

Jason