On Thu, Oct 28, 2021 at 3:08 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Wed, Oct 27, 2021 at 1:01 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > On Wed, Oct 27, 2021 at 11:35 AM Alexey Alexandrov <aalexand@xxxxxxxxxx> wrote: > > > > > > > On Oct 19, 2021, at 2:55 PM, Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > > > From: Colin Cross <ccross@xxxxxxxxxx> > > > > > > > > In many userspace applications, and especially in VM based applications > > > > like Android uses heavily, there are multiple different allocators in use. > > > > At a minimum there is libc malloc and the stack, and in many cases there > > > > are libc malloc, the stack, direct syscalls to mmap anonymous memory, and > > > > multiple VM heaps (one for small objects, one for big objects, etc.). > > > > Each of these layers usually has its own tools to inspect its usage; > > > > malloc by compiling a debug version, the VM through heap inspection tools, > > > > and for direct syscalls there is usually no way to track them. > > > > > > > > On Android we heavily use a set of tools that use an extended version of > > > > the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped > > > > in userspace and slice their usage by process, shared (COW) vs. unique > > > > mappings, backing, etc. This can account for real physical memory usage > > > > even in cases like fork without exec (which Android uses heavily to share > > > > as many private COW pages as possible between processes), Kernel SamePage > > > > Merging, and clean zero pages. It produces a measurement of the pages > > > > that only exist in that process (USS, for unique), and a measurement of > > > > the physical memory usage of that process with the cost of shared pages > > > > being evenly split between processes that share them (PSS). > > > > > > > > If all anonymous memory is indistinguishable then figuring out the real > > > > physical memory usage (PSS) of each heap requires either a pagemap walking > > > > tool that can understand the heap debugging of every layer, or for every > > > > layer's heap debugging tools to implement the pagemap walking logic, in > > > > which case it is hard to get a consistent view of memory across the whole > > > > system. > > > > > > > > Tracking the information in userspace leads to all sorts of problems. > > > > It either needs to be stored inside the process, which means every > > > > process has to have an API to export its current heap information upon > > > > request, or it has to be stored externally in a filesystem that > > > > somebody needs to clean up on crashes. It needs to be readable while > > > > the process is still running, so it has to have some sort of > > > > synchronization with every layer of userspace. Efficiently tracking > > > > the ranges requires reimplementing something like the kernel vma > > > > trees, and linking to it from every layer of userspace. It requires > > > > more memory, more syscalls, more runtime cost, and more complexity to > > > > separately track regions that the kernel is already tracking. > > > > > > > > This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a > > > > userspace-provided name for anonymous vmas. The names of named anonymous > > > > vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:<name>]. > > > > > > > > Userspace can set the name for a region of memory by calling > > > > prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); > > > > Setting the name to NULL clears it. The name length limit is 80 bytes > > > > including NUL-terminator and is checked to contain only printable ascii > > > > characters (including space), except '[',']','\','$' and '`'. Ascii > > > > strings are being used to have a descriptive identifiers for vmas, which > > > > can be understood by the users reading /proc/pid/maps or /proc/pid/smaps. > > > > Names can be standardized for a given system and they can include some > > > > variable parts such as the name of the allocator or a library, tid of > > > > the thread using it, etc. > > > > > > > > The name is stored in a pointer in the shared union in vm_area_struct > > > > that points to a null terminated string. Anonymous vmas with the same > > > > name (equivalent strings) and are otherwise mergeable will be merged. > > > > The name pointers are not shared between vmas even if they contain the > > > > same name. The name pointer is stored in a union with fields that are > > > > only used on file-backed mappings, so it does not increase memory usage. > > > > > > > > CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this > > > > feature. It keeps the feature disabled by default to prevent any > > > > additional memory overhead and to avoid confusing procfs parsers on > > > > systems which are not ready to support named anonymous vmas. > > > > > > > > The patch is based on the original patch developed by Colin Cross, more > > > > specifically on its latest version [1] posted upstream by Sumit Semwal. > > > > It used a userspace pointer to store vma names. In that design, name > > > > pointers could be shared between vmas. However during the last upstreaming > > > > attempt, Kees Cook raised concerns [2] about this approach and suggested > > > > to copy the name into kernel memory space, perform validity checks [3] > > > > and store as a string referenced from vm_area_struct. > > > > One big concern is about fork() performance which would need to strdup > > > > anonymous vma names. Dave Hansen suggested experimenting with worst-case > > > > scenario of forking a process with 64k vmas having longest possible names > > > > [4]. I ran this experiment on an ARM64 Android device and recorded a > > > > worst-case regression of almost 40% when forking such a process. This > > > > regression is addressed in the followup patch which replaces the pointer > > > > to a name with a refcounted structure that allows sharing the name pointer > > > > between vmas of the same name. Instead of duplicating the string during > > > > fork() or when splitting a vma it increments the refcount. > > > > > > > > [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@xxxxxxxxxx/ > > > > [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/ > > > > [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/ > > > > [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@xxxxxxxxx/ > > > > > > > > Changes for prctl(2) manual page (in the options section): > > > > > > > > PR_SET_VMA > > > > Sets an attribute specified in arg2 for virtual memory areas > > > > starting from the address specified in arg3 and spanning the > > > > size specified in arg4. arg5 specifies the value of the attribute > > > > to be set. Note that assigning an attribute to a virtual memory > > > > area might prevent it from being merged with adjacent virtual > > > > memory areas due to the difference in that attribute's value. > > > > > > > > Currently, arg2 must be one of: > > > > > > > > PR_SET_VMA_ANON_NAME > > > > Set a name for anonymous virtual memory areas. arg5 should > > > > be a pointer to a null-terminated string containing the > > > > name. The name length including null byte cannot exceed > > > > 80 bytes. If arg5 is NULL, the name of the appropriate > > > > anonymous virtual memory areas will be reset. The name > > > > can contain only printable ascii characters (including > > > > space), except '[',']','\','$' and '`'. > > > > > > > > This feature is available only if the kernel is built with > > > > the CONFIG_ANON_VMA_NAME option enabled. > > > > > > For what it’s worth, it’s definitely interesting to see this going upstream. > > > In particular, we would use it for high-level grouping of the data in > > > production profiling when proper symbolization is not available: > > > > > > * JVM could associate a name with the memory regions it uses for the JIT > > > code so that Linux perf data are associated with a high level name like > > > "Java JIT" even if the proper Java JIT profiling is not enabled. > > > * Similar for other JIT engines like v8 - they could annotate the memory > > > regions they manage and use as well. > > > * Traditional memory allocators like tcmalloc can use this as well so > > > that the associated name is used in data access profiling via Linux perf. > > > > Hi Alexey, > > Thanks for providing your feedback! Nice to hear that this can be > > useful outside of Android. > > Folks, it has been almost two weeks since I posted this v11 patchset. > Is there anything else I can do to advance it towards merging? Hi Andrew, I haven't seen any feedback on my patchset for some time now. I think I addressed all the questions and comments (please correct me if I missed anything). Can it be accepted as is or is there something I should address further? >From the feedback, I see that there are several interested parties in this patchset (albeit all from different teams at Google) but maybe when it's merged more users will start using it. I believe I've done everything I could to ensure no/minimal impact on the users who don't use this feature. Please advise. Thanks, Suren.