In this week, I've tested some IO congested workload for a while. and probably I did reproduced Andreas's issue. So, I would like to explain current lumpy reclaim how works and why so much sucks. 1. Now isolate_lru_pages() have following pfn neighber grabbing logic. for (; pfn < end_pfn; pfn++) { (snip) if (__isolate_lru_page(cursor_page, mode, file) == 0) { list_move(&cursor_page->lru, dst); mem_cgroup_del_lru(cursor_page); nr_taken++; nr_lumpy_taken++; if (PageDirty(cursor_page)) nr_lumpy_dirty++; scan++; } else { if (mode == ISOLATE_BOTH && page_count(cursor_page)) nr_lumpy_failed++; } } Mainly, __isolate_lru_page() failure can be caused following reasons. (1) the page have already been freed and is in buddy. (2) the page is used for non user process purpose (3) the page is unevictable (e.g. mlocked) (2), (3) have very different characteristic from (1). the lumpy reclaim mean 'contenious physical memory reclaiming'. that said, if we are trying order 9 reclaim, 512 pages reclaim success and 511 pages reclaim success are completely differennt. former mean lumpy reclaim successfull, latter mean failure. So, if (2) or (3) occur, that pfn have lost a possibility of lumpy reclaim successfull. then, we should stop pfn neighbor search immediately and try to get lru next page. (i.e. we should use 'break' statement instead 'continue') 2. synchronous lumpy reclaim condition is insane. currently, synchrounous lumpy reclaim will be invoked when following condition. if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) { but "nr_reclaimed < nr_taken" is pretty stupid. if isolated pages have much dirty pages, pageout() only issue first 113 IOs. (if io queue have >113 requests, bdi_write_congested() return true and may_write_to_queue() return false) So, we haven't call ->writepage(), congestion_wait() and wait_on_page_writeback() are surely stupid. 3. pageout() is intended anynchronous api. but doesn't works so. pageout() call ->writepage with wbc->nonblocking=1. because if the system have default vm.dirty_ratio (i.e. 20), we have 80% clean memory. so, getting stuck on one page is stupid, we should scan much pages as soon as possible. HOWEVER, block layer ignore this argument. if slow usb memory device connect to the system, ->writepage() will sleep long time. because submit_bio() call get_request_wait() unconditionally and it doesn't have any PF_MEMALLOC task bonus. 4. synchronous lumpy reclaim call clear_active_flags(). but it is also silly. Now, page_check_references() ignore pte young bit when we are processing lumpy reclaim. Then, In almostly case, PageActive() mean "swap device is full". Therefore, waiting IO and retry pageout() are just silly. In andres's case, congestion_wait() and get_request_wait() are root cause. Other issue is problematic when more higher order lumpy reclaim. Now, I'm preparing some patches and probably I can send them tommorow. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>