>>>> >>>> We should just make this 0.1% of RAM (min(0.1% ram, 64KB)) or something >>>> like what was suggested, if that will help move things forward. IMHO the >>>> 32MB machine is mostly a theoretical case, but whatever . >>> >>> 1) I'm deeply concerned about large ZONE_MOVABLE and MIGRATE_CMA ranges >>> where FOLL_LONGTERM cannot be used, as that memory is not available. >>> >>> 2) With 0.1% RAM it's sufficient to start 1000 processes to break any >>> system completely and deeply mess up the MM. Oh my. >> >> We're talking per-user limits here. But if you want to talk hyperbole, >> then 64K multiplied by some other random number will also allow >> everything to be pinned, potentially. >> > > Right, it's per-user. 0.1% per user FOLL_LONGTERM locked into memory in > the worst case. > To make it clear why I keep complaining about FOLL_LONGTERM for unprivileged users even if we're talking about "only" 0.1% of RAM ... On x86-64 a 2 MiB THP (IOW pageblock) has 512 sub-pages. If we manage to FOLL_LONGTERM a single sub-page, we can make the THP unavailable to the system, meaning we cannot form a THP by compaction/swapping/migration/whatever at that physical memory area until we unpin that single page. We essentially "block" a THP from forming at that physical memory area. So with a single 4k page we can block one 2 MiB THP. With 0.1% we can, therefore, block 51,2 % of all THP. Theoretically, of course, if the stars align. ... or if we're malicious or unlucky. I wrote a reproducer this morning that tries blocking as many THP as it can: https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/io_uring_thp.c ------------------------------------------------------------------------ Example on my 16 GiB (8096 THP "in theory") notebook with some applications running in the background. $ uname -a Linux t480s 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux $ ./io_uring_thp PAGE size: 4096 bytes (sensed) THP size: 2097152 bytes (sensed) RLIMIT_MEMLOCK: 16777216 bytes (sensed) IORING_MAX_REG_BUFFERS: 16384 (guess) Pages per THP: 512 User can block 4096 THP (8589934592 bytes) Process can block 4096 THP (8589934592 bytes) Blocking 1 THP Blocking 2 THP ... Blocking 3438 THP Blocking 3439 THP Blocking 3440 THP Blocking 3441 THP Blocking 3442 THP ... and after a while Blocking 4093 THP Blocking 4094 THP Blocking 4095 THP Blocking 4096 THP $ cat /proc/`pgrep io_uring_thp`/status Name: io_uring_thp Umask: 0002 State: S (sleeping) [...] VmPeak: 6496 kB VmSize: 6496 kB VmLck: 0 kB VmPin: 16384 kB VmHWM: 3628 kB VmRSS: 1580 kB RssAnon: 160 kB RssFile: 1420 kB RssShmem: 0 kB VmData: 4304 kB VmStk: 136 kB VmExe: 8 kB VmLib: 1488 kB VmPTE: 48 kB VmSwap: 0 kB HugetlbPages: 0 kB CoreDumping: 0 THP_enabled: 1 $ cat /proc/meminfo MemTotal: 16250920 kB MemFree: 11648016 kB MemAvailable: 11972196 kB Buffers: 50480 kB Cached: 1156768 kB SwapCached: 54680 kB Active: 704788 kB Inactive: 3477576 kB Active(anon): 427716 kB Inactive(anon): 3207604 kB Active(file): 277072 kB Inactive(file): 269972 kB ... Mlocked: 5692 kB SwapTotal: 8200188 kB SwapFree: 7742716 kB ... AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB Let's see how many contiguous 2M pages we can still get as root: $ echo 1 > /proc/sys/vm/compact_memory $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 0 $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 537 ... keep retrying a couple of times $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 583 Let's kill the io_uring process and try again: $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4766 ... keep retrying a couple of times $ echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 4823 ------------------------------------------------------------------------ I'm going to leave judgment how bad this is or isn't to the educated reader, and I'll stop spending time on this as I have more important things to work on. To summarize my humble opinion: 1) I am not against raising the default memlock limit if it's for a sane use case. While mlock itself can be somewhat bad for swap, FOLL_LONGTERM that also checks the memlock limit here is the real issue. This patch explicitly states the "IOURING_REGISTER_BUFFERS" use case, though, and that makes me nervous. 2) Exposing FOLL_LONGTERM to unprivileged users should be avoided best we can; in an ideal world, we wouldn't have it at all; in a sub-optimal world we'd have it only for use cases that really require it due to HW limitations. Ideally we'd even have yet another limit for this, because mlock != FOLL_LONGTERM. 3) IOURING_REGISTER_BUFFERS shouldn't use FOLL_LONGTERM for use by unprivileged users. We should provide a variant that doesn't rely on FOLL_LONGTERM or even rely on the memlock limit. Sorry to the patch author for bringing it up as response to the patch. After this patch just does what some distros already do (many distros even provide higher limits than 8 MiB!). I would be curious why some distros already have such high values ... and if it's already because of IOURING_REGISTER_BUFFERS after all. -- Thanks, David / dhildenb