On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote: > > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > > > > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote: > > > > > > > As the oom reaper is the primary guarantee of the oom handling forward > > > > progress it cannot be blocked on anything that might depend on blockable > > > > memory allocations. These are not really easy to track because they > > > > might be indirect - e.g. notifier blocks on a lock which other context > > > > holds while allocating memory or waiting for a flusher that needs memory > > > > to perform its work. > > > > > > But lockdep *does* track all this and fs_reclaim_acquire() was created > > > to solve exactly this problem. > > > > > > fs_reclaim is a lock and it flows through all the usual lockdep > > > schemes like any other lock. Any time the page allocator wants to do > > > something the would deadlock with reclaim it takes the lock. > > > > > > Failure is expressed by a deadlock cycle in the lockdep map, and > > > lockdep can handle arbitary complexity through layers of locks, work > > > queues, threads, etc. > > > > > > What is missing? > > > > Lockdep doens't seen everything by far. E.g. a wait_event will be > > caught by the annotations here, but not by lockdep. > > Sure, but the wait_event might be OK if its progress isn't contingent > on fs_reclaim, ie triggered from interrupt, so why ban it? For normal notifiers sure (but lockdep won't help you proof that at all). For oom_reaper apparently not, but that's really Michal's thing, I have not much idea about that. > > And since we're talking about mmu notifiers here and gpus/dma engines. > > We have dma_fence_wait, which can wait for any hw/driver in the system > > that takes part in shared/zero-copy buffer processing. Which at least > > on the graphics side is everything. This pulls in enormous amounts of > > deadlock potential that lockdep simply is blind about and will never > > see. > > It seems very risky to entagle a notifier widely like that. That's why I've looked into all possible ways to annotate them with debug infrastructure. > It looks pretty sure that notifiers are fs_reclaim, so at a minimum > that wait_event can't be contingent on anything that is doing > GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep > safe. > > Avoiding an uncertain wait_event under notifiers would seem to be the > only reasonable design here.. You have to wait for the gpu to finnish current processing in invalidate_range_start. Otherwise there's no point to any of this really. So the wait_event/dma_fence_wait are unavoidable really. That's also why I'm throwing in the lockdep annotation on top, and why it would be really nice if we somehow can get the cross-release work landed. But it looks like all the people who could make it happen are busy with smeltdown :-/ -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch