Re: [mm-unstable v7 12/18] mm/madvise: add MADV_COLLAPSE to process_madvise()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jul 08 13:47, Andrew Morton wrote:
> On Wed,  6 Jul 2022 16:59:30 -0700 "Zach O'Keefe" <zokeefe@xxxxxxxxxx> wrote:
> 
> > Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has
> > CAP_SYS_ADMIN or is requesting collapse of it's own memory.
> 
> This is maximally restrictive.  I didn't see any discussion of why this
> was chosen either here of in the [0/N].  I expect that people will be
> coming after us to relax this.
> 
> So please do add (a lot of) words explaining this decision, and
> describing what might be done in the future to relax it.

Hey Andrew,

Thanks for taking the time to look at this series. After taking a look through
capabilities(7) I think you're absolutely right to call this out - thanks for
that.

I think move_pages(2) seems to be the best comparison here. There, we use
CAP_SYS_NICE + PTRACE_MODE_READ_REALCREDS to ensure the caller is able to
copying + moving memory of an eternal process, between nodes.  This is also the
current default for process_madvise(2). However, MADV_COLLAPSE additionally is
able to:

1) Influence the RSS of a process / memory charged to a cgroup (by
  collapsing a hugepage-sized/aligned region with nonresident pages). Note that
  for file/shmem, this might cause increase in file/shmem RSS for non-target
  mm's.
2) Bypass sysfs THP settings

For (1), process_madvise(MADV_WILLNEED) could presumably be used to increase RSS
/ memcg usage, and we don't require any additional capabilities there.

For (2), I don't think there is an easy precedent. I think it makes sense that
the caller has write permission to /sys/kernel/mm/transparent_hugapage/*.
AFAICT, this means an effective user ID of 0 ... which is similarly restrictive
like CAP_SYS_ADMIN. One idea would be to use CAP_SETUID, since these threads
could always assume an real/effective user ID of 0.

That said, I'm note sure CAP_SETUID is needed, and perhaps the existing
process_madvise(2) restrictions are enough given CAP_SYS_NICE confers ability to
copy around all the same memory.. we'll just be doing some additional page table
manipulations after some of that copying - which should (mostly) be transparent
to the users. I.e. I don't think it expands CAP_SYS_NICE's "security silo" that
much. Could be wrong through.

Again, thanks for your time,
Zach





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux