On Wed 20-07-16 14:06:26, David Rientjes wrote: > On Wed, 20 Jul 2016, Michal Hocko wrote: > > > > Any mempool_alloc() user that then takes a contended mutex can do this. > > > An example: > > > > > > taskA taskB taskC > > > ----- ----- ----- > > > mempool_alloc(a) > > > mutex_lock(b) > > > mutex_lock(b) > > > mempool_alloc(a) > > > > > > Imagine the mempool_alloc() done by taskA depleting all free elements so > > > we rely on it to do mempool_free() before any other mempool allocator can > > > be guaranteed. > > > > > > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory > > > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in > > > mempool_alloc(). This livelocks the page allocator for all processes. > > > > > > taskB in this case need only stall after taking mutex_lock() successfully; > > > that could be because of the oom livelock, it is contended on another > > > mutex held by an allocator, etc. > > > > But that falls down to the deadlock described by Johannes above because > > then the mempool user would _depend_ on an "unguarded page allocation" > > via that particular lock and that is a bug. > > > > It becomes a deadlock because of mempool_alloc(a) forcing > __GFP_NOMEMALLOC, I agree. > > For that not to be the case, it must be required that between > mempool_alloc() and mempool_free() that we take no mutex that may be held > by any other thread on the system, in any context, that is allocating > memory. If that's a caller's bug as you describe it, and only enabled by > mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant > lockdep detection, which would be trivial to add, so we can determine if > any users are unsafe and prevent this issue in the future. I am sorry but I am neither familiar with the lockdep internals nor I have a time to add this support. > The > overwhelming goal here should be to prevent possible problems in the > future especially if an API does not allow you to opt-out of the behavior. The __GFP_NOMEMALLOC enforcement is there since b84a35be0285 ("[PATCH] mempool: NOMEMALLOC and NORETRY") so more than 10 years ago. So I think it is quite reasonable to expect that users are familiar with this fact and handle it properly in the vast majority cases. In fact mempool deadlocks are really rare. [...] > > Or it would get stuck because even page allocator memory reserves got > > depleted. Without any way to throttle there is no guarantee to make > > further progress. In fact this is not a theoretical situation. It has > > been observed with the swap over dm-crypt and there shouldn't be any > > lock dependeces you are describing above there AFAIU. > > > > They should do mempool_alloc(__GFP_NOMEMALLOC), no argument. How that would be any different from any other mempool user which can be invoked from the swap out path - aka any other IO path? > What is the objection to allowing __GFP_NOMEMALLOC from the caller with > clear documentation on how to use it? It can be described to not allow > depletion of memory reserves with the caveat that the caller must ensure > mempool_free() cannot be blocked in lowmem situations. Look, there are $ git grep mempool_alloc | wc -l 304 many users of this API and we do not want to flip the default behavior which is there for more than 10 years. So far you have been arguing about potential deadlocks and haven't shown any particular path which would have a direct or indirect dependency between mempool and normal allocator and it wouldn't be a bug. As the matter of fact the change we are discussing here causes a regression. If you want to change the semantic of mempool allocator then you are absolutely free to do so. In a separate patch which would be discussed with IO people and other users, though. But we _absolutely_ want to fix the regression first and have a simple fix for 4.6 and 4.7 backports. At this moment there are revert and patch 1 on the table. The later one should make your backtrace happy and should be only as a temporal fix until we find out what is actually misbehaving on your systems. If you are not interested to pursue that way I will simply go with the revert. -- Michal Hocko SUSE Labs -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel