On Fri, May 14, 2021 at 11:25:43AM +0200, David Hildenbrand wrote: > > #ifdef CONFIG_IA64 > > # include <linux/efi.h> > > @@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size) > > #ifdef CONFIG_STRICT_DEVMEM > > static inline int page_is_allowed(unsigned long pfn) > > { > > + if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn))) > > + return 0; > > + > > 1. The memmap might be garbage. You should use pfn_to_online_page() instead. > > page = pfn_to_online_page(pfn); > if (page && page_is_secretmem(page)) > return 0; > > 2. What about !CONFIG_STRICT_DEVMEM? > > 3. Someone could map physical memory before a secretmem page gets allocated > and read the content after it got allocated and gets used. If someone would > gain root privileges and would wait for the target application to (re)start, > that could be problematic. > > > I do wonder if enforcing CONFIG_STRICT_DEVMEM would be cleaner. > devmem_is_allowed() should disallow access to any system ram, and thereby, > any possible secretmem pages, avoiding this check completely. I've been thinking a bit more about the /dev/mem case, it seems I was to fast on the trigger with adding that test for page_is_secretmem(). When CONFIG_STRICT_DEVMEM=y the access to RAM is anyway forbidden and if the user built a kernel with CONFIG_STRICT_DEVMEM=n all the physical memory is accessible by root anyway. We might want to default STRICT_DEVMEM to "y" for all architectures and not only arm64, ppc and x86, but this is not strictly related to this series. > [...] > > > diff --git a/mm/secretmem.c b/mm/secretmem.c > > new file mode 100644 > > index 000000000000..1ae50089adf1 > > --- /dev/null > > +++ b/mm/secretmem.c > > @@ -0,0 +1,239 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +/* > > + * Copyright IBM Corporation, 2021 > > + * > > + * Author: Mike Rapoport <rppt@xxxxxxxxxxxxx> > > + */ > > + > > +#include <linux/mm.h> > > +#include <linux/fs.h> > > +#include <linux/swap.h> > > +#include <linux/mount.h> > > +#include <linux/memfd.h> > > +#include <linux/bitops.h> > > +#include <linux/printk.h> > > +#include <linux/pagemap.h> > > +#include <linux/syscalls.h> > > +#include <linux/pseudo_fs.h> > > +#include <linux/secretmem.h> > > +#include <linux/set_memory.h> > > +#include <linux/sched/signal.h> > > + > > +#include <uapi/linux/magic.h> > > + > > +#include <asm/tlbflush.h> > > + > > +#include "internal.h" > > + > > +#undef pr_fmt > > +#define pr_fmt(fmt) "secretmem: " fmt > > + > > +/* > > + * Define mode and flag masks to allow validation of the system call > > + * parameters. > > + */ > > +#define SECRETMEM_MODE_MASK (0x0) > > +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK > > + > > +static bool secretmem_enable __ro_after_init; > > +module_param_named(enable, secretmem_enable, bool, 0400); > > +MODULE_PARM_DESC(secretmem_enable, > > + "Enable secretmem and memfd_secret(2) system call"); > > + > > +static vm_fault_t secretmem_fault(struct vm_fault *vmf) > > +{ > > + struct address_space *mapping = vmf->vma->vm_file->f_mapping; > > + struct inode *inode = file_inode(vmf->vma->vm_file); > > + pgoff_t offset = vmf->pgoff; > > + gfp_t gfp = vmf->gfp_mask; > > + unsigned long addr; > > + struct page *page; > > + int err; > > + > > + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) > > + return vmf_error(-EINVAL); > > + > > +retry: > > + page = find_lock_page(mapping, offset); > > + if (!page) { > > + page = alloc_page(gfp | __GFP_ZERO); > > We'll end up here with gfp == GFP_HIGHUSER (via the mapping below), correct? Yes > > + if (!page) > > + return VM_FAULT_OOM; > > + > > + err = set_direct_map_invalid_noflush(page, 1); > > + if (err) { > > + put_page(page); > > + return vmf_error(err); > > Would we want to translate that to a proper VM_FAULT_..., which would most > probably be VM_FAULT_OOM when we fail to allocate a pagetable? That's what vmf_error does, it translates -ESOMETHING to VM_FAULT_XYZ. > > + } > > + > > + __SetPageUptodate(page); > > + err = add_to_page_cache_lru(page, mapping, offset, gfp); > > + if (unlikely(err)) { > > + put_page(page); > > + /* > > + * If a split of large page was required, it > > + * already happened when we marked the page invalid > > + * which guarantees that this call won't fail > > + */ > > + set_direct_map_default_noflush(page, 1); > > + if (err == -EEXIST) > > + goto retry; > > + > > + return vmf_error(err); > > + } > > + > > + addr = (unsigned long)page_address(page); > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > Hmm, to me it feels like something like that belongs into the > set_direct_map_invalid_*() calls? Otherwise it's just very easy to mess up > ... AFAIU set_direct_map() deliberately do not flush TLB and leave it to the caller to allow gathering multiple updates of the direct map and doing a single TLB flush afterwards. > I'm certainly not a filesystem guy. Nothing else jumped at me. > > > To me, the overall approach makes sense and I consider it an improved > mlock() mechanism for storing secrets, although I'd love to have some more > information in the log regarding access via root, namely that there are > still fancy ways to read secretmem memory once root via > > 1. warm reboot attacks especially in VMs (e.g., modifying the cmdline) > 2. kexec-style reboot attacks (e.g., modifying the cmdline) > 3. kdump attacks > 4. kdb most probably > 5. "letting the process read the memory for us" via Kees if that still > applies > 6. ... most probably something else > > Just to make people aware that there are still some things to be sorted out > when we fully want to protect against privilege escalations. > > (maybe this information is buried in the cover letter already, where it > usually gets lost) I believe that it belongs more to the man page than to changelog so that the *users* are aware of secretmem limitations. -- Sincerely yours, Mike.