On Fri, Jul 12, 2013 at 1:51 PM, Colin Cross <ccross@xxxxxxxxxxx> wrote: > On Fri, Jul 12, 2013 at 2:49 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >> On Fri, Jul 12, 2013 at 11:40:44AM +0200, Ingo Molnar wrote: >>> * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >>> >>> > On Fri, Jul 12, 2013 at 11:15:06AM +0200, Ingo Molnar wrote: >>> > > >>> > > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: >>> > > >>> > > > We need those files anyway.. The current proposal is that the entire VMA >>> > > > has a single userspace pointer in it. Or rather a 64bit value. >>> > > >>> > > Yes but accessible via /proc/<PID>/mem or so? >>> > >>> > *shudder*.. yes. But you're again opening two files. The only advantage >>> > of this over userspace writing its own files is that the kernel cleans >>> > things up for you. >>> >>> Opening of the files only occurs in the instrumentation case, which is >>> rare. But temporary files would be forced upon the regular usecase when no >>> instrumentation goes on. >> >> Well, Colin didn't describe the intended use, but I can imagine a case where >> its not all that rare. System health monitors might frequently want to update >> this. >> >>> > However from what I understood android runs apps as individual users, >>> > and I think we can do per user tmpfs mounts. So app dies, user exits, >>> > mount goes *poof*. >>> >>> Yes, user-space could be smarter about temporary files. >>> >>> Just like big banks could be less risk happy. >>> >>> Yet the reality is that if left alone both apps and banks mess up, I don't >>> think libertarianism works for policy: we are better off offering a >>> framework that is simple, robust, self-contained, low risk and hard to >>> mess up? >> >> Fair enough; but I still want Colin to tell me why he can't do this in >> userspace. And what all he wants to go do with this information etc. >> >> He's basically not told us much at all. > > I covered it a little in the thread on the previous version of the > patch, but I'll try to give more detail (and include it in a patch > stack description if I post another version). > > In many userspace applications, and especially in VM based > applications like Android uses heavily, there are multiple different > allocators in use. At a minimum there is libc malloc and the stack, > and in many cases there are libc malloc, the stack, direct syscalls to > mmap anonymous memory, and multiple VM heaps (one for small objects, > one for big objects, etc.). Each of these layers usually has its own > tools to inspect its usage; malloc by compiling a debug version, the > VM through heap inspection tools, and for direct syscalls there is > usually no way to track them. > > On Android we heavily use a set of tools that use an extended version > of the logic covered in Documentation/vm/pagemap.txt to walk all pages > mapped in userspace and slice their usage by process, shared (COW) vs. > unique mappings, backing, etc. This can account for real physical > memory usage even in cases like fork without exec (which Android uses > heavily to share as many private COW pages as possible between > processes), Kernel SamePage Merging, and clean zero pages. It > produces a measurement of the pages that only exist in that process > (USS, for unique), and a measurement of the physical memory usage of > that process with the cost of shared pages being evenly split between > processes that share them (PSS). We need the feature to be efficient > enough to be left on at all times because app developers and end users > can use similar tools exposed through system reports and bugreports to > determine the memory usage of apps > > If all anonymous memory is indistinguishable then figuring out the > real physical memory usage of each heap requires either a pagemap > walking tool that can understand the heap debugging of every layer, or > for every layer's heap debugging tools to implement the pagemap > walking logic, in which case it is hard to get a consistent view of > memory across the whole system. > > Tracking the information in userspace leads to all sorts of problems. > It either needs to be stored inside the process, which means every > process has to have an API to export its current heap information upon > request, or it has to be stored externally in a filesystem that > somebody needs to clean up on crashes. It needs to be readable while > the process is still running, so it has to have some sort of > synchronization with every layer of userspace. Efficiently tracking > the ranges requires reimplementing something like the kernel vma > trees, and linking to it from every layer of userspace. It requires > more memory, more syscalls, more runtime cost, and more complexity to > separately track regions that the kernel is already tracking. > > This feature is considered critical enough that Dalvik (Android's VM) > uses ashmem, which is effectively deleted tmpfs files, solely to name > their heaps. I'd like to get rid of as much ashmem use within > Android as possible, with an eye towards deprecating it. ashmem heaps > work reasonably well for a VM, which is likely to want a single > contiguous region of address space that it will manage on its own, but > falls apart for malloc, which often wants small kernel-allocated > address space regions that may or may not merge with adjacent regions. > Blindly using ashmem/deleted tmpfs files instead of anonymous mmaps > in malloc doubled the number of vmas in our main system process and > was worse for the GLBenchmark process. > > As a concrete example of its usefulness (which should not be > considered the extent of its usefulness, it's just what I happened to > be looking at), I was recently tracking down why we were seeing many > dirty private pages that were all zeroes being merged by KSM. Using a > mixture of ashmem naming and an early version of this patch, I could > slice the the number of KSM merged pages per process and per heap, > which then told me which heap debugging tools I should use to find who > was dirtying large regions of zeroes. Peter, any thoughts on this? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>