On 22.11.21 21:44, Jens Axboe wrote: > On 11/22/21 1:08 PM, David Hildenbrand wrote: >> On 22.11.21 20:53, Jens Axboe wrote: >>> On 11/22/21 11:26 AM, David Hildenbrand wrote: >>>> On 22.11.21 18:55, Andrew Dona-Couch wrote: >>>>> Forgive me for jumping in to an already overburdened thread. But can >>>>> someone pushing back on this clearly explain the issue with applying >>>>> this patch? >>>> >>>> It will allow unprivileged users to easily and even "accidentally" >>>> allocate more unmovable memory than it should in some environments. Such >>>> limits exist for a reason. And there are ways for admins/distros to >>>> tweak these limits if they know what they are doing. >>> >>> But that's entirely the point, the cases where this change is needed are >>> already screwed by a distro and the user is the administrator. This is >>> _exactly_ the case where things should just work out of the box. If >>> you're managing farms of servers, yeah you have competent administration >>> and you can be expected to tweak settings to get the best experience and >>> performance, but the kernel should provide a sane default. 64K isn't a >>> sane default. >> >> 0.1% of RAM isn't either. > > No default is perfect, byt 0.1% will solve 99% of the problem. And most > likely solve 100% of the problems for the important case, which is where > you want things to Just Work on your distro without doing any > administration. If you're aiming for perfection, it doesn't exist. ... and my Fedora is already at 16 MiB *sigh*. And I'm not aiming for perfection, I'm aiming for as little FOLL_LONGTERM users as possible ;) > >>>> This is not a step into the right direction. This is all just trying to >>>> hide the fact that we're exposing FOLL_LONGTERM usage to random >>>> unprivileged users. >>>> >>>> Maybe we could instead try getting rid of FOLL_LONGTERM usage and the >>>> memlock limit in io_uring altogether, for example, by using mmu >>>> notifiers. But I'm no expert on the io_uring code. >>> >>> You can't use mmu notifiers without impacting the fast path. This isn't >>> just about io_uring, there are other users of memlock right now (like >>> bpf) which just makes it even worse. >> >> 1) Do we have a performance evaluation? Did someone try and come up with >> a conclusion how bad it would be? > > I honestly don't remember the details, I took a look at it about a year > ago due to some unrelated reasons. These days it just pertains to > registered buffers, so it's less of an issue than back then when it > dealt with the rings as well. Hence might be feasible, I'm certainly not > against anyone looking into it. Easy enough to review and test for > performance concerns. That at least sounds promising. > >> 2) Could be provide a mmu variant to ordinary users that's just good >> enough but maybe not as fast as what we have today? And limit >> FOLL_LONGTERM to special, privileged users? > > If it's not as fast, then it's most likely not good enough though... There is always a compromise of course. See, FOLL_LONGTERM is *the worst* kind of memory allocation thingy you could possible do to your MM subsystem. It's absolutely the worst thing you can do to swap and compaction. I really don't want random feature X to be next and say "well, io_uring uses it, so I can just use it for max performance and we'll adjust the memlock limit, who cares!". > >> 3) Just because there are other memlock users is not an excuse. For >> example, VFIO/VDPA have to use it for a reason, because there is no way >> not do use FOLL_LONGTERM. > > It's not an excuse, the statement merely means that the problem is > _worse_ as there are other memlock users. Yes, and it will keep getting worse every time we introduce more FOLL_LONGTERM users that really shouldn't be FOLL_LONGTERM users unless really required. Again, VFIO/VDPA/RDMA are prime examples, because the HW forces us to do it. And these are privileged features either way. > >>> >>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >>> like what was suggested, if that will help move things forward. IMHO the >>> 32MB machine is mostly a theoretical case, but whatever . >> >> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges >> where FOLL_LONGTERM cannot be used, as that memory is not available. >> >> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any >> system completely and deeply mess up the MM. Oh my. > > We're talking per-user limits here. But if you want to talk hyperbole, > then 64K multiplied by some other random number will also allow > everything to be pinned, potentially. > Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in the worst case. -- Thanks, David / dhildenb