-- Summary -- Preliminary version of memory protection patchset, including a sample use case, turning into write-rare the IMA measurement list. The core idea is to introduce two new types of memory protection, beside const and __ro_after_init, which will support: - statically allocated "write rare" memory - dynamically allocated "read only" and "write rare" memory On top of that, follows a set of patches which create a "write rare" counterpart of the kernel infrastructure used in the example chose for hardening: the IMA measurement list. -- Mechanism -- Statically allocated protected memory is identified by the __wr_after_init tag, which will cause the linker to place it in a special section. Dynamically allocated memory is obtained through vmalloc, but compacting each allocation, where possible, in the latest obtained vmap_area. The write rare mechanism is implemented by creating a temporary alternate writable mapping, applying the change through this mapping and then removing it. All of this is possible thanks to the system MMU, which must be able to provide write protection. -- Brief history -- I sent out various versions of memory protection over the last year or so, however this patchset is significantly expanded, including several helper data structures and a use case, so I decided to reset the numbering to v1. As reference, the latest "old" version is here [1]. The current version is not yet ready for merge, however it is sufficiently complete for supporting an end-to-end discussion, I think. Eventually, I plan to write a white paper, once the code is in better shape. In the meanwhile, an overview can be had from these slides [2], which are the support material for my presentation at the Linux Security Summit 2018 Europe. -- Validation -- Most of the testing is done on a Fedora image, with QEMU x86_64, however the code has been also tested on a real x86_64 PC, yielding similar positive results. For ARM64, I use a custom Debian installation, still with QEMU, but I have obtained similar failures when testing with a real device, using a Kirin970. I have written some test cases for the most basic parts and the behaviour of IMA and the Fedora image in general do not seem to be negatively affected, when used in conjunction with this patchset. However, it's far from being exhaustive testing and the torture test for rcu is completely missing. -- Known Issues -- As said, this version is preliminary and certain parts need rework. This is a short and incomplete list of known issues: * arm64 support is broken for __wr_after_init I must create a separate section with proper mappings, similar to the ones used for vmalloc() * alignment of data structures has not been throughly checked There are probably several redundant forced alignments * there is no fallback for platforms missing MMU write protection * some additional care might be needed when dealing with double mapping vs data cache coherency, on multicore systems * lots of additional (stress) tests are needed * memory reuse (object caches) are probably needed, to support converting more use cases, and so also other data structures. * credits for original code: I have reimplemented various data structures, I am not sure if I have given credit correctly to the original authors. * documentation for the re-implemented data structures is missing * confirm that the hardened usercopy logic is correct -- Q&As -- During reviews of the older patchset, several objections and questions were formulated. They are collected here in Q&A format, with both some old and new answers: 1 - "The protection can still be undone" Yes, it is true. Using a hypervisor, like it is done in certain Huawei and Samsung phones, provides a better level of protection. However, even without that, it still gives a significantly better level of protection than not protecting the memory at all. The main advantage of this patchset is that now the attack has to focus on the page table, which is a significantly smaller area, than the whole kernel data. It is my intention, eventually, to provide also support for interaction with a FOSS hypervisor (ex: KVM), but this patchset should support also those cases where it's not even possible to have an hypervisor. So it seems simpler to start from there. The hypervisor is not mandatory. 2 - "Do not provide a chain of trust, but protect some memory and refer to it with a writable pointer." This might be ok for protecting against bugs, but in the case of an attacker trying to compromise the system, the unprotected pointer has become the new target. It doesn't change much. Samsung does use a similar implementation, for protecting LSM hooks, however that solution also add a pointer, from the protected memory back to the writable memory, as validation loop. And the price to pay is that every time the unprotected pointer must be used, it first has to be validated, to point to a certain memory range and to have a specific alignment. It's an alternative solution to the full chain of trust and each has its specific advantages, depending on the data structures that one wants to protect. 3 - "Do not use a secondary mapping, unprotect the current one" The purpose of the secondary mapping is to create a hard-to-spot window of writability at a random address, which cannot be easily exploited. Unprotecting the primary mapping would allow an attack where a core is busy looping trying to figure out if a specific location becomes writable and race against the legitimate writer. For the same reason, interrupts are disabled on the core that is performing the write-rare operation. 4 - "Do not create another allocator over vmalloc(), use it plain" This is not good for various reasons: a) vmalloc() allocates at least one page for every request it receives, leaving most of the page typically unused. While it might not be a big deal on large systems, on IoT class devices it is possible to find relatively powerful cores paired to relatively little memory. Taking as example a system using SELinux, a relatively small set of rules can genarate a few thousands of allocations (SELinux is deny-by-default). Modeling each allocation to be about 64bytes, on a system with 4kB pages, assuming that the grand total of allocation is 100k, that means 100k * 4kB = 390MB while, using each 64bytes slot in a page yields: 100k * 64B = 6MB The first case would not be very compatible with a system having only 512MB or 1GB. b) even worse, the amount of thrashing of the TLB would be terrible, with each allocation having its own translation. -- Signed-off-by: Igor Stoppa <igor.stoppa@xxxxxxxxxx> -- References -- [1]: https://lkml.org/lkml/2018/4/23/508 [2]: https://events.linuxfoundation.org/wp-content/uploads/2017/12/Kernel-Hardening-Protecting-the-Protection-Mechanisms-Igor-Stoppa-Huawei.pdf -- List of patches -- [PATCH 01/17] prmem: linker section for static write rare [PATCH 02/17] prmem: write rare for static allocation [PATCH 03/17] prmem: vmalloc support for dynamic allocation [PATCH 04/17] prmem: dynamic allocation [PATCH 05/17] prmem: shorthands for write rare on common types [PATCH 06/17] prmem: test cases for memory protection [PATCH 07/17] prmem: lkdtm tests for memory protection [PATCH 08/17] prmem: struct page: track vmap_area [PATCH 09/17] prmem: hardened usercopy [PATCH 10/17] prmem: documentation [PATCH 11/17] prmem: llist: use designated initializer [PATCH 12/17] prmem: linked list: set alignment [PATCH 13/17] prmem: linked list: disable layout randomization [PATCH 14/17] prmem: llist, hlist, both plain and rcu [PATCH 15/17] prmem: test cases for prlist and prhlist [PATCH 16/17] prmem: pratomic-long [PATCH 17/17] prmem: ima: turn the measurements list write rare -- Diffstat -- Documentation/core-api/index.rst | 1 + Documentation/core-api/prmem.rst | 172 +++++ MAINTAINERS | 14 + drivers/misc/lkdtm/core.c | 13 + drivers/misc/lkdtm/lkdtm.h | 13 + drivers/misc/lkdtm/perms.c | 248 +++++++ include/asm-generic/vmlinux.lds.h | 20 + include/linux/cache.h | 17 + include/linux/list.h | 5 +- include/linux/mm_types.h | 25 +- include/linux/pratomic-long.h | 73 ++ include/linux/prlist.h | 934 ++++++++++++++++++++++++ include/linux/prmem.h | 446 +++++++++++ include/linux/prmemextra.h | 133 ++++ include/linux/types.h | 20 +- include/linux/vmalloc.h | 11 +- lib/Kconfig.debug | 9 + lib/Makefile | 1 + lib/test_prlist.c | 252 +++++++ mm/Kconfig | 6 + mm/Kconfig.debug | 9 + mm/Makefile | 2 + mm/prmem.c | 273 +++++++ mm/test_pmalloc.c | 629 ++++++++++++++++ mm/test_write_rare.c | 236 ++++++ mm/usercopy.c | 5 + mm/vmalloc.c | 7 + security/integrity/ima/ima.h | 18 +- security/integrity/ima/ima_api.c | 29 +- security/integrity/ima/ima_fs.c | 12 +- security/integrity/ima/ima_main.c | 6 + security/integrity/ima/ima_queue.c | 28 +- security/integrity/ima/ima_template.c | 14 +- security/integrity/ima/ima_template_lib.c | 14 +- 34 files changed, 3635 insertions(+), 60 deletions(-)