On Tue, Sep 01, 2020 at 09:44:59PM +0530, Sumit Semwal wrote: > From: Colin Cross <ccross@xxxxxxxxxx> > > In many userspace applications, and especially in VM based applications > like Android uses heavily, there are multiple different allocators in use. > At a minimum there is libc malloc and the stack, and in many cases there > are libc malloc, the stack, direct syscalls to mmap anonymous memory, and > multiple VM heaps (one for small objects, one for big objects, etc.). > Each of these layers usually has its own tools to inspect its usage; > malloc by compiling a debug version, the VM through heap inspection tools, > and for direct syscalls there is usually no way to track them. > > On Android we heavily use a set of tools that use an extended version of > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped > in userspace and slice their usage by process, shared (COW) vs. unique > mappings, backing, etc. This can account for real physical memory usage > even in cases like fork without exec (which Android uses heavily to share > as many private COW pages as possible between processes), Kernel SamePage > Merging, and clean zero pages. It produces a measurement of the pages > that only exist in that process (USS, for unique), and a measurement of > the physical memory usage of that process with the cost of shared pages > being evenly split between processes that share them (PSS). > > If all anonymous memory is indistinguishable then figuring out the real > physical memory usage (PSS) of each heap requires either a pagemap walking > tool that can understand the heap debugging of every layer, or for every > layer's heap debugging tools to implement the pagemap walking logic, in > which case it is hard to get a consistent view of memory across the whole > system. > > Tracking the information in userspace leads to all sorts of problems. > It either needs to be stored inside the process, which means every > process has to have an API to export its current heap information upon > request, or it has to be stored externally in a filesystem that > somebody needs to clean up on crashes. It needs to be readable while > the process is still running, so it has to have some sort of > synchronization with every layer of userspace. Efficiently tracking > the ranges requires reimplementing something like the kernel vma > trees, and linking to it from every layer of userspace. It requires > more memory, more syscalls, more runtime cost, and more complexity to > separately track regions that the kernel is already tracking. > > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a > userspace-provided name for anonymous vmas. The names of named anonymous > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. > > Userspace can set the name for a region of memory by calling > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); > Setting the name to NULL clears it. > > The name is stored in a user pointer in the shared union in vm_area_struct > that points to a null terminated string inside the user process. vmas > that point to the same address and are otherwise mergeable will be merged, > but vmas that point to equivalent strings at different addresses will not > be merged. > > The idea to store a userspace pointer to reduce the complexity within mm > (at the expense of the complexity of reading /proc/pid/mem) came from Dave > Hansen. This results in no runtime overhead in the mm subsystem other > than comparing the anon_name pointers when considering vma merging. The > pointer is stored in a union with fields that are only used on file-backed > mappings, so it does not increase memory usage. > (Upstream changed to remove the union, so this patch adds it back as well) This is, for me, the weirdest part of the series. Very little in the kernel does stuff like this, and given the Fun(tm) we've had with things like userfaultfd, I cannot begin to imagine what it'd be nice to have this kind of control from a process over what is reading its maps file. (i.e. stalling a maps reader by installing a userfaultfd for the anon_name: that'd be a DoS for anything trying to read the maps file, etc.) Why is a kernel-copied string insufficient for this? I don't think VMA merging is a fast-path operation, so doing a strcmp isn't going to wreck anything... Let me try to find the earlier thread with Dave Hansen... okay, found it: https://lore.kernel.org/linux-mm/51DDF071.5000309@xxxxxxxxx/ Right, so, this idea predates userfaultfd. :) More notes below, but I *really* think this should not be a userspace pointer. And since a separate union has been found, let's just do a strndup_user() for the name, validate it as containing only printable characters without \n \r \v \f and move the merging logic into a separate patch. > > Signed-off-by: Colin Cross <ccross@xxxxxxxxxx> > Cc: Pekka Enberg <penberg@xxxxxxxxxx> > Cc: Dave Hansen <dave.hansen@xxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Oleg Nesterov <oleg@xxxxxxxxxx> > Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > Cc: Jan Glauber <jan.glauber@xxxxxxxxx> > Cc: John Stultz <john.stultz@xxxxxxxxxx> > Cc: Rob Landley <rob@xxxxxxxxxxx> > Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx> > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > Cc: "Serge E. Hallyn" <serge.hallyn@xxxxxxxxxx> > Cc: David Rientjes <rientjes@xxxxxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > Cc: Rik van Riel <riel@xxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Michel Lespinasse <walken@xxxxxxxxxx> > Cc: Tang Chen <tangchen@xxxxxxxxxxxxxx> > Cc: Robin Holt <holt@xxxxxxx> > Cc: Shaohua Li <shli@xxxxxxxxxxxx> > Cc: Sasha Levin <sasha.levin@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Minchan Kim <minchan@xxxxxxxxxx> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Sumit Semwal <sumit.semwal@xxxxxxxxxx> > > --- > > v2: updates the commit message to explain in more detail why the > patch is useful. > v3: renames vma_get_anon_name to vma_anon_name > replaces logic in seq_print_vma_name with access_process_vm > removes Name: entry from smaps, it's already on the header line > changes the prctl option number to match what is currently in > use on Android > v4: adds paragraph to commit log on why this is better than tracking > in userspace > squashes fixes from Andrew Morton to fix build error and warning > fix build error reported by Mark Salter when !CONFIG_MMU > v5: rebased to v5.9-rc1, added minor fixes to match upstream > v6: rebased to v5.9-rc3, and addressed review comments: > - added missing callers in fs/userfaultd.c > - simplified the union > - use the new access_remote_vm_locked() in show_map_vma() > since that already holds mmap_lock > v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't > defined (In the future, can you link to lore URLs for the earlier versions? It makes it easier to do research on past threads...) > --- > Documentation/filesystems/proc.rst | 2 ++ > fs/proc/task_mmu.c | 24 ++++++++++++- > fs/userfaultfd.c | 7 ++-- > include/linux/mm.h | 12 ++++++- > include/linux/mm_types.h | 25 +++++++++++--- > include/uapi/linux/prctl.h | 3 ++ > kernel/sys.c | 32 ++++++++++++++++++ > mm/interval_tree.c | 2 +- > mm/madvise.c | 54 +++++++++++++++++++++++++++--- > mm/mempolicy.c | 3 +- > mm/mlock.c | 2 +- > mm/mmap.c | 38 ++++++++++++--------- > mm/mprotect.c | 2 +- > 13 files changed, 173 insertions(+), 33 deletions(-) > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 533c79e8d2cd..41a9cea73b8b 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -429,6 +429,8 @@ is not associated with a file: > [stack] the stack of the main process > [vdso] the "virtual dynamic shared object", > the kernel system call handler > +[anon:<name>] an anonymous mapping that has been typo: needs a single leading space here. > + named by userspace > ======= ==================================== > > or if empty, the mapping is anonymous. > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 5066b0251ed8..290ce5ecad0d 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -97,6 +97,21 @@ unsigned long task_statm(struct mm_struct *mm, > return mm->total_vm; > } > > +static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma) > +{ > + struct mm_struct *mm = vma->vm_mm; > + char anon_name[NAME_MAX + 1]; > + int n; > + > + n = access_remote_vm_locked(mm, (unsigned long)vma_anon_name(vma), anon_name, > + NAME_MAX, 0); > + if (n > 0) { > + seq_puts(m, "[anon:"); > + seq_write(m, anon_name, strnlen(anon_name, n)); > + seq_putc(m, ']'); seq_printf(m, "[anon:%s]", anon_name); ? > + } > +} > [...] > +#ifdef CONFIG_ADVISE_SYSCALLS > +int madvise_set_anon_name(unsigned long start, unsigned long len_in, > + unsigned long name_addr); > +#else > +static inline int madvise_set_anon_name(unsigned long start, unsigned long len_in, > + unsigned long name_addr) { > + return 0; Why not -EINVAL? > +} > +#endif > [...] > +static int madvise_vma_anon_name(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end, > + unsigned long name_addr) > +{ > + int error; > + > + /* Only anonymous mappings can be named */ > + if (vma->vm_file) > + return -EINVAL; But a process could fork, and then change the contents of the vma_name address in one process, showing different names for the same VMA in maps? O_o That feels like an unexpected result -- the VMA should have the same name in either process. > [...] > static inline int is_mergeable_vma(struct vm_area_struct *vma, > struct file *file, unsigned long vm_flags, > - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) > + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, > + const char __user *anon_name) I think at the very least, this patch should be split in half: first add the feature, and then add the ability to merge. -- Kees Cook