On Fri, May 24, 2013 at 11:40 AM, Christoph Lameter <cl@xxxxxxxxx> wrote: > On Fri, 24 May 2013, Peter Zijlstra wrote: > >> Patch bc3e53f682 ("mm: distinguish between mlocked and pinned pages") >> broke RLIMIT_MEMLOCK. > > Nope the patch fixed a problem with double accounting. > > The problem that we seem to have is to define what mlocked and pinned mean > and how this relates to RLIMIT_MEMLOCK. > > mlocked pages are pages that are movable (not pinned!!!) and that are > marked in some way by user space actions as mlocked (POSIX semantics). > They are marked with a special page flag (PG_mlocked). > > Pinned pages are pages that have an elevated refcount because the hardware > needs to use these pages for I/O. The elevated refcount may be temporary > (then we dont care about this) or for a longer time (such as the memory > registration of the IB subsystem). That is when we account the memory as > pinned. The elevated refcount stops page migration and other things from > trying to move that memory. > > Pages can be both pinned and mlocked. Before my patch some pages those two > issues were conflated since the same counter was used and therefore these > pages were counted twice. If an RDMA application was running using > mlockall() and was performing large scale I/O then the counters could show > extraordinary large numbers and the VM would start to behave erratically. > > It is important for the VM to know which pages cannot be evicted but that > involves many more pages due to dirty pages etc etc. > > So far the assumption has been that RLIMIT_MEMLOCK is a limit on the pages > that userspace has mlocked. > > You want the counter to mean something different it seems. What is it? > > I think we need to be first clear on what we want to accomplish and what > these counters actually should count before changing things. Hm. If pinned and mlocked are totally difference intentionally, why IB uses RLIMIT_MEMLOCK. Why don't IB uses IB specific limit and why only IB raise up number of pinned pages and other gup users don't. I can't guess IB folk's intent. And now ever IB code has duplicated RLIMIT_MEMLOCK check and at least __ipath_get_user_pages() forget to check capable(CAP_IPC_LOCK). That's bad. > Certainly would appreciate improvements in this area but resurrecting the > conflation between mlocked and pinned pages is not the way to go. > >> This patch proposes to properly fix the problem by introducing >> VM_PINNED. This also provides the groundwork for a possible mpin() >> syscall or MADV_PIN -- although these are not included. > > Maybe add a new PIN page flag? Pages are not pinned per vma as the patch > seems to assume. Generically, you are right. But if VM_PINNED is really only for IB, this is acceptable limitation. They can split vma for their own purpose. Anyway, I agree we should clearly understand the semantics of IB pinning and the userland usage and assumption. -- To unsubscribe from this list: send the line "unsubscribe trinity" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html