On Tue 01-12-15 14:25:45, Vladimir Davydov wrote: > On Fri, Nov 27, 2015 at 04:01:33PM +0100, Michal Hocko wrote: > > On Fri 27-11-15 16:40:03, Vladimir Davydov wrote: [...] > > > The problem is not about our driver, in fact. I'm pretty sure one can > > > hit it when using memcg along with loop or dm-crypt for instance. > > > > I am not familiar much with neither but from a quick look the loop > > driver doesn't use mempools tool, it simply relays the data to the > > underlaying file and relies on the underlying fs to write all the pages > > and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then > > I am not really sure how can we guarantee a forward progress. The > > GFP_NOFS allocation might loop inside the allocator endlessly and so > > the writeback wouldn't make any progress. This doesn't seem to be only > > memcg specific. The global case would just replace the deadlock by a > > livelock. I certainly must be missing something here. > > Yeah, I think you're right. If the loop kthread gets stuck in the > reclaimer, it might occur that other processes will isolate, reclaim and > then dirty reclaimed pages, preventing the kthread from running and > cleaning memory, so that we might end up with all memory being under > writeback and no reclaimable memory left for kthread to run and clean > it. Due to dirty limit, this is unlikely to happen though, but I'm not > sure. > > OTOH, with legacy memcg, there is no dirty limit and we can isolate a > lot of pages (SWAP_CLUSTER_MAX = 512 now) per process and wait on page > writeback to complete before releasing them, which sounds bad. And we > can't just remove this wait_on_page_writeback from shrink_page_list, > otherwise OOM might be triggered prematurely. May be, we should putback > rotated pages and release all reclaimed pages before initiating wait? So you want to write for the page while it is on the LRU? I think it would be better to simply not throttle the loopback kthread endlessly in the first place. I was tempted to add an exception and do not wait for writeback on loopback (and alike) devices but this is a dirty hack and I think it is reasonable to rely on the global case to work properly and guarantee a forward progress. And this seems broken currently: FS1 (marks page Writeback) -> request -> BLK -> LOOP -> vfs_iter_write -> FS2 So the writeback bit will be set until vfs_iter_write finishes and that doesn't require data to be written down to the FS2 backing store AFAICS. But what happens if vfs_iter_write gets throttled on the dirty limit or FS2 performs a GFP_NOFS allocation? We will get stuck there for ever without any progress. If loopback FS load is dominant then we are screwed. So I think the global case needs a fix. If it works properly our waiting in the memcg reclaim will be finite as well. On the other hand if all the reclaimable memory is marked as writeback or dirty then we should trigger OOM killer sooner or later so I am not so sure this is such a big deal. I have to play with this though. Anyway I guess we at least want to wait for the writebit killable. I will cook up a patch. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>