On Jul 08 13:47, Andrew Morton wrote: > On Wed, 6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@xxxxxxxxxx> wrote: > > > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has > > CAP_SYS_ADMIN or is requesting collapse of it's own memory. > > This is maximally restrictive. I didn't see any discussion of why this > was chosen either here of in the [0/N]. I expect that people will be > coming after us to relax this. > > So please do add (a lot of) words explaining this decision, and > describing what might be done in the future to relax it. Hey Andrew, Thanks for taking the time to look at this series. After taking a look through capabilities(7) I think you're absolutely right to call this out - thanks for that. I think move_pages(2) seems to be the best comparison here. There, we use CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to copying + moving memory of an eternal process, between nodes. This is also the current default for process_madvise(2). However, MADV_COLLAPSE additionally is able to: 1) Influence the RSS of a process / memory charged to a cgroup (by collapsing a hugepage-sized/aligned region with nonresident pages). Note that for file/shmem, this might cause increase in file/shmem RSS for non-target mm's. 2) Bypass sysfs THP settings For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS / memcg usage, and we don't require any additional capabilities there. For (2), I don't think there is an easy precedent. I think it makes sense that the caller has write permission to /sys/kernel/mm/transparent_hugapage/*. AFAICT, this means an effective user ID of 0 ... which is similarly restrictive like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads could always assume an real/effective user ID of 0. That said, I'm note sure CAP_SETUID is needed, and perhaps the existing process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to copy around all the same memory.. we'll just be doing some additional page table manipulations after some of that copying - which should (mostly) be transparent to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that much. Could be wrong through. Again, thanks for your time, Zach