On Thu, Oct 17, 2024 at 1:49 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > On Thu, Oct 17, 2024 at 01:34:53PM -0700, Jeff Xu wrote: > > Hi Pedro > > > > On Thu, Oct 17, 2024 at 12:37 PM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > > > > > > For PROT_NONE mappings, the previous blocking of > > > > madvise(MADV_DONTNEED) is unnecessary. As PROT_NONE already prohibits > > > > memory access, madvise(MADV_DONTNEED) should be allowed to proceed in > > > > order to free the page. > > > > > > I don't get it. Is there an actual use case for this? > > > > > Sealing should not over-blocking API that it can allow to pass without > > security concern, this is a case in that principle. > > Well, making the interface simple is also important. OpenBSD's mimmutable() > doesn't do any of this and it Just Works(tm)... > > > > > There is a user case for this as well: to seal NX stack on android, > > Android uses PROT_NONE/madvise to set up a guide page to prevent stack > > run over boundary. So we need to let madvise to pass. > > And you need to MADV_DONTNEED this guard page? > Yes. > > > > > > For file-backed, private, read-only memory mappings, we previously did > > > > not block the madvise(MADV_DONTNEED). This was based on > > > > the assumption that the memory's content, being file-backed, could be > > > > retrieved from the file if accessed again. However, this assumption > > > > failed to consider scenarios where a mapping is initially created as > > > > read-write, modified, and subsequently changed to read-only. The newly > > > > introduced VM_WASWRITE flag addresses this oversight. > > > > > > We *do not* need this. It's sufficient to just block discard operations on read-only > > > private mappings. > > I think you meant blocking madvise(MADV_DONTNEED) on all read-only > > private file-backed mappings. > > > > I considered that option, but there is a use case for madvise on those > > mappings that never get modified. > > > > Apps can use that to free up RAM. e.g. Considering read-only .text > > section, which never gets modified, madvise( MADV_DONTNEED) can free > > up RAM when memory is in-stress, memory will be reclaimed from a > > backed-file on next read access. Therefore we can't just block all > > read-only private file-backed mapping, only those that really need to, > > such as mapping changed from rw=>r (what you described) > > Does anyone actually do this? If so, why? WHYYYY? > This is a legit use case, I can't argue that it isn't. > The kernel's page reclaim logic should be perfectly cromulent. Please don't do this. > MADV_DONTNEED will also not free any pages if those are shared (rather they'll just be unmapped). > > If we really need to do this, I'd maybe suggest walking through page tables, looking for > anon ptes or swap ptes (maybe inside the actual zap code?). But I would really prefer if we > didn't need to do this. > I also considered this route, but it is too complicated. The copy-on-write pages can be put into a swap file, also there is a huge page to consider, etc, The complication makes it really difficult to code it right, also scanning those pages on per VMA level will require lock and also impact performance. > -- > Pedro