On Fri, Jul 12, 2013 at 2:49 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > On Fri, Jul 12, 2013 at 11:40:44AM +0200, Ingo Molnar wrote: >> * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> >> > On Fri, Jul 12, 2013 at 11:15:06AM +0200, Ingo Molnar wrote: >> > > >> > > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> > > >> > > > We need those files anyway.. The current proposal is that the entire VMA >> > > > has a single userspace pointer in it. Or rather a 64bit value. >> > > >> > > Yes but accessible via /proc/<PID>/mem or so? >> > >> > *shudder*.. yes. But you're again opening two files. The only advantage >> > of this over userspace writing its own files is that the kernel cleans >> > things up for you. >> >> Opening of the files only occurs in the instrumentation case, which is >> rare. But temporary files would be forced upon the regular usecase when no >> instrumentation goes on. > > Well, Colin didn't describe the intended use, but I can imagine a case where > its not all that rare. System health monitors might frequently want to update > this. > >> > However from what I understood android runs apps as individual users, >> > and I think we can do per user tmpfs mounts. So app dies, user exits, >> > mount goes *poof*. >> >> Yes, user-space could be smarter about temporary files. >> >> Just like big banks could be less risk happy. >> >> Yet the reality is that if left alone both apps and banks mess up, I don't >> think libertarianism works for policy: we are better off offering a >> framework that is simple, robust, self-contained, low risk and hard to >> mess up? > > Fair enough; but I still want Colin to tell me why he can't do this in > userspace. And what all he wants to go do with this information etc. > > He's basically not told us much at all. I covered it a little in the thread on the previous version of the patch, but I'll try to give more detail (and include it in a patch stack description if I post another version). In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). We need the feature to be efficient enough to be left on at all times because app developers and end users can use similar tools exposed through system reports and bugreports to determine the memory usage of apps If all anonymous memory is indistinguishable then figuring out the real physical memory usage of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This feature is considered critical enough that Dalvik (Android's VM) uses ashmem, which is effectively deleted tmpfs files, solely to name their heaps. I'd like to get rid of as much ashmem use within Android as possible, with an eye towards deprecating it. ashmem heaps work reasonably well for a VM, which is likely to want a single contiguous region of address space that it will manage on its own, but falls apart for malloc, which often wants small kernel-allocated address space regions that may or may not merge with adjacent regions. Blindly using ashmem/deleted tmpfs files instead of anonymous mmaps in malloc doubled the number of vmas in our main system process and was worse for the GLBenchmark process. As a concrete example of its usefulness (which should not be considered the extent of its usefulness, it's just what I happened to be looking at), I was recently tracking down why we were seeing many dirty private pages that were all zeroes being merged by KSM. Using a mixture of ashmem naming and an early version of this patch, I could slice the the number of KSM merged pages per process and per heap, which then told me which heap debugging tools I should use to find who was dirtying large regions of zeroes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>