On 13.04.22 19:52, Jason Gunthorpe wrote: > On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: >> On 12.04.22 16:36, Jason Gunthorpe wrote: >>> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: >>> >>>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >>>> past already with secretmem, it's not 100% that good of a fit (unmovable >>>> is worth than mlocked). But it gets the job done for now at least. >>> >>> No, it doesn't. There are too many different interpretations how >>> MELOCK is supposed to work >>> >>> eg VFIO accounts per-process so hostile users can just fork to go past >>> it. >>> >>> RDMA is per-process but uses a different counter, so you can double up >>> >>> iouring is per-user and users a 3rd counter, so it can triple up on >>> the above two >> >> Thanks for that summary, very helpful. > > I kicked off a big discussion when I suggested to change vfio to use > the same as io_uring > > We may still end up trying it, but the major concern is that libvirt > sets the RLIMIT_MEMLOCK and if we touch anything here - including > fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > Okay, so we have to introduce a second mechanism, don't use RLIMIT_MEMLOCK for new unmovable memory, and then eventually phase out RLIMIT_MEMLOCK usage for existing unmovable memory consumers (which, as you say, will be difficult). >>>> So I'm open for alternative to limit the amount of unmovable memory we >>>> might allocate for user space, and then we could convert seretmem as well. >>> >>> I think it has to be cgroup based considering where we are now :\ >> >> Most probably. I think the important lessons we learned are that >> >> * mlocked != unmovable. >> * RLIMIT_MEMLOCK should most probably never have been abused for >> unmovable memory (especially, long-term pinning) > > The trouble is I'm not sure how anything can correctly/meaningfully > set a limit. > > Consider qemu where we might have 3 different things all pinning the > same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? > What use is that really? I think your tackling a related problem, that we double-account unmovable/mlocked memory due to lack of ways to track that a page is already pinned by the same user/cgroup/whatsoever. Not easy to solve. The problem also becomes interesting if iouring with fixed buffers doesn't work on guest RAM, but on some other QEMU buffers. > > IMHO there are only two meaningful scenarios - either you are unpriv > and limited to a very small number for your user/cgroup - or you are > priv and you can do whatever you want. > > The idea we can fine tune this to exactly the right amount for a > workload does not seem realistic and ends up exporting internal kernel > decisions into a uAPI.. IMHO, there are three use cases: * App that conditionally uses selected mechanism that end up requiring unmovable, long-term allocations. Secretmem, iouring, rdma. We want some sane, small default. Apps have a backup path in case any such mechanism fails because we're out of allowed unmovable resources. * App that relies on selected mechanism that end up requiring unmovable, long-term allocations. E.g., vfio with known memory consumption, such as the VM size. It's fairly easy to come up with the right value. * App that relies on multiple mechanism that end up requiring unmovable, long-term allocations. QEMU with rdma, iouring, vfio, ... I agree that coming up with something good is problematic. Then, there are privileged/unprivileged apps. There might be admins that just don't care. There might be admins that even want to set some limit instead of configuring "unlimited" for QEMU. Long story short, it should be an admin choice what to configure, especially: * What the default is for random apps * What the maximum is for selected apps * Which apps don't have a maximum -- Thanks, David / dhildenb