On Fri 09-10-15 14:00:31, Nikolay Borisov wrote: > Hello mm people, > > > I want to ask you the following question which stemmed from analysing > and chasing this particular deadlock: > http://permalink.gmane.org/gmane.linux.kernel/2056730 This link doesn't seem to work properly for me. Could you post a http://lkml.kernel.org/r/$msg_id link please? > To summarise it: > > For simplicity I will use the following nomenclature: > t1 - kworker/u96:0 > t2 - kworker/u98:39 > t3 - kworker/u98:7 > > t1 issues drain_all_pages which generates IPI's, at the same time > however, OK, as per http://lkml.kernel.org/r/1444318308-27560-1-git-send-email-kernel%40kyup.com drain_all_pages is called from the __alloc_pages_nodemask called from slab allocator. There is no stack leading to the allocation but then you are saying > t2 has already started doing async write of pages > as part of its normal operation but is blocked upon t1 completion of > its IPI (generated from drain_all_pages) since they both work on the > same dm-thin volume. which I read as the allocator is holding the same dm_bufio_lock, right? > At the same time again, t3 is executing > ext4_finish_bio, which disables interrupts, yet is dependent on t2 > completing its writes. That would be a bug on its own because ext4_finish_bio seems to be called from SoftIRQ context so it cannot wait for a regular scheduling context. Whoever is holding that lock BH_Uptodate_Lock has to be in (soft)IRQ context. <found the original thread on linux-mm finally - the threading got broken on the way> http://lkml.kernel.org/r/20151013131453.GA1332%40quack.suse.cz So Jack (CCed) thinks this is a non-atomic update of flags and that indeed sounds plausible. > But since it has disabled interrupts, it wont > respond to t1's IPI and at this point a hard lock up occurs. This > happens, since drain_all_pages calls on_each_cpu_mask with the last > argument equal to "true" meaning "wait until the ipi handler has > finished", which of course will never happen in the described situation. > > Based on that I was wondering whether avoiding such situation might > merit making drain_all_pages invocation from > __alloc_pages_direct_reclaim dependent on a particular GFP being passed > e.g. GFP_NOPCPDRAIN or something along those lines? I do not think so. Even if the dependency was real it would be a clear deadlock even without drain_all_pages AFAICS. > Alternatively would it be possible to make the IPI asycnrhonous e.g. > calling on_each_cpu_mask with the last argument equal to false? Strictly speaking the allocation path doesn't really depend on the sync behavior. We are just trying to release pages on pcp lists and retry the allocation. Even if the allocation context was faster than other CPUs and fail the request then we would try again without triggering the OOM because the reclaim has apparently made some progress. Other callers might be more sensitive. Anyway this is called only if the allocator issues a sleeping allocation request so I think that waiting here is perfectly acceptable. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>