On 21.10.24 19:15, Lorenzo Stoakes wrote:
On Mon, Oct 21, 2024 at 07:05:27PM +0200, David Hildenbrand wrote:
On 20.10.24 18:20, Lorenzo Stoakes wrote:
Implement a new lightweight guard page feature, that is regions of userland
virtual memory that, when accessed, cause a fatal signal to arise.
Currently users must establish PROT_NONE ranges to achieve this.
However this is very costly memory-wise - we need a VMA for each and every
one of these regions AND they become unmergeable with surrounding VMAs.
In addition repeated mmap() calls require repeated kernel context switches
and contention of the mmap lock to install these ranges, potentially also
having to unmap memory if installed over existing ranges.
The lightweight guard approach eliminates the VMA cost altogether - rather
than establishing a PROT_NONE VMA, it operates at the level of page table
entries - poisoning PTEs such that accesses to them cause a fault followed
by a SIGSGEV signal being raised.
This is achieved through the PTE marker mechanism, which a previous commit
in this series extended to permit this to be done, installed via the
generic page walking logic, also extended by a prior commit for this
purpose.
These poison ranges are established with MADV_GUARD_POISON, and if the
range in which they are installed contain any existing mappings, they will
be zapped, i.e. free the range and unmap memory (thus mimicking the
behaviour of MADV_DONTNEED in this respect).
Any existing poison entries will be left untouched. There is no nesting of
poisoned pages.
Poisoned ranges are NOT cleared by MADV_DONTNEED, as this would be rather
unexpected behaviour, but are cleared on process teardown or unmapping of
memory ranges.
Ranges can have the poison property removed by MADV_GUARD_UNPOISON -
'remedying' the poisoning. The ranges over which this is applied, should
they contain non-poison entries, will be untouched, only poison entries
will be cleared.
We permit this operation on anonymous memory only, and only VMAs which are
non-special, non-huge and not mlock()'d (if we permitted this we'd have to
drop locked pages which would be rather counterintuitive).
Suggested-by: Vlastimil Babka <vbabka@xxxxxxx>
Suggested-by: Jann Horn <jannh@xxxxxxxxxx>
Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>
---
arch/alpha/include/uapi/asm/mman.h | 3 +
arch/mips/include/uapi/asm/mman.h | 3 +
arch/parisc/include/uapi/asm/mman.h | 3 +
arch/xtensa/include/uapi/asm/mman.h | 3 +
include/uapi/asm-generic/mman-common.h | 3 +
mm/madvise.c | 168 +++++++++++++++++++++++++
mm/mprotect.c | 3 +-
mm/mseal.c | 1 +
8 files changed, 186 insertions(+), 1 deletion(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 763929e814e9..71e13f27742d 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -78,6 +78,9 @@
#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
+#define MADV_GUARD_POISON 102 /* fatal signal on access to range */
+#define MADV_GUARD_UNPOISON 103 /* revoke guard poisoning */
Just to raise it here: MADV_GUARD_INSTALL / MADV_GUARD_REMOVE or sth. like
that would have been even clearer, at least to me.
:)
It still feels like poisoning to me because we're explicitly putting
something in the page tables to make a range have different fault behaviour
like a HW poisoning, and 'installing' suggests backing or something like
this, I think that's more confusing.
I connect "poison" to "SIGBUS" and "corrupt memory state", not to "there
is nothing and there must not be anything". Thus my thinking. But again,
not the end of the world, just wanted to raise it ...
But no strong opinion, just if somebody else reading along was wondering
about the same.
I'm hoping to find more time to have a closer look at this this week, but in
general, the concept sounds reasonable to me.
Thanks!
Thank you for implementing this and up-streaming it!
--
Cheers,
David / dhildenb