On Fri, 15 Jul 2016, Michal Hocko wrote: > > If PF_MEMALLOC context is allocating too much memory reserves, then I'd > > argue that is a problem independent of using mempool_alloc() since > > mempool_alloc() can evolve directly into a call to the page allocator. > > How does such a process guarantee that it cannot deplete memory reserves > > with a simple call to the page allocator? Since nothing in the page > > allocator is preventing complete depletion of reserves (it simply uses > > ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be > > responsible. > > Well, the reclaim throttles the allocation request if there are too many > pages under writeback and that should slow down the allocation rate and > give the writeback some time to complete. But yes you are right there is > nothing to prevent from memory depletion and it is really hard to come > up with something with no fail semantic. > If the reclaimer is allocating memory, it can fully deplete memory reserves with ALLOC_NO_WATERMARKS without any direct reclaim itself and we're relying on kswapd entirely if nothing else is reclaiming in parallel (and depleting memory reserves itself in parallel). It's a difficult problem because memory reserves can be very small and concurrent PF_MEMALLOC allocation contexts can lead to quick depletion. I don't think it's a throttling problem itself, it's more scalability. > I would like separate TIF_MEMDIE as an access to memory reserves from > oom selection selection semantic. And let me repeat your proposed patch > has a undesirable side effects so we should think about a way to deal > with those cases. It might work for your setups but it shouldn't break > others at the same time. OOM situation is quite unlikely compared to > simple memory depletion by writing to a swap... > I haven't proposed any patch, not sure what the reference is to. There's two fundamental ways to go about it: (1) ensure mempool_alloc() can make forward progress (whether that's by way of gfp flags or access to memory reserves, which may depend on the process context such as PF_MEMALLOC) or (2) rely on an implementation detail of mempools to never access memory reserves, although it is shown to not livelock systems on 4.7 and earlier kernels, and instead rely on users of the same mempool to return elements to the freelist in all contexts, including oom contexts. The mempool implementation itself shouldn't need any oom awareness, that should be a page allocator issue. If the mempool user can guarantee that elements will be returned to the freelist in all contexts, we could relax the restriction that mempool users cannot use __GFP_NOMEMALLOC and leave it up to them to prevent access to memory reserves but only in situations where forward progress can be guaranteed. That's a simple change and doesn't change mempool or page allocator behavior for everyone, but rather only for those that opt-in. I think this is the way the dm folks should proceed, but let's not encode any special restriction on access to memory reserves as an implementation detail to mempools, specifically for processes that have PF_MEMALLOC set. > Just to make sure I understand properly: > Task A Task B Task C > current->flags = PF_MEMALLOC > mutex_lock(&foo) mutex_lock(&foo) out_of_memory > mempool_alloc() select_bad__process = Task B > alloc_pages(__GFP_NOMEMALLOC) > Not sure who is grabbing foo first with this, I assume Task A and Task B is contending. If that's the case, then yes, this is the dm_request() oom livelock that went unresolved for two hours on our machines and timed them all out. This is a swapless environment that heavily oversubscribes the machine, so not everybody's use case, but it needs to be resolved. > That would be really unfortunate but it doesn't really differ much from > other oom deadlocks when the victim is stuck behind an allocating task. I'm well aware of many of the system oom and memcg oom livelocks from experience, unfortunately :) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>