On Mon, Aug 2, 2010 at 8:57 PM, Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote: >> > So swapping is another major cause of responsiveness lags. >> > >> > I just tested the heavy swapping case with the patches to remove >> > the congestion_wait() and wait_on_page_writeback() stalls on high >> > order allocations. The patches work as expected. No single stall shows >> > up with the debug patch posted in http://lkml.org/lkml/2010/8/1/10. >> > >> > However there are still stalls on get_request_wait(): >> > - kswapd trying to pageout anonymous pages >> > - _any_ process in direct reclaim doing pageout() >> >> Well, not any. >> >> current check is following. >> >> ----------------------------------------------------------- >> static int may_write_to_queue(struct backing_dev_info *bdi) >> { >> if (current->flags & PF_SWAPWRITE) >> return 1; >> if (!bdi_write_congested(bdi)) >> return 1; >> if (bdi == current->backing_dev_info) >> return 1; >> return 0; >> } >> ----------------------------------------------------------- >> >> It mean congestion ignorerance is happend when followings >> (1) the task is kswapd >> (2) the task is flusher thread >> (3) this reclaim is called from zone reclaim (note: I'm thinking this is bug) >> (4) this reclaim is called from __generic_file_aio_write() >> >> (4) is root cause of this latency issue. this behavior was introduced >> by following. > > Yes and no. > > (1)-(4) are good summaries for regular files. However !bdi_write_congested(bdi) > is now unconditionally true for the swapper_space, which means any process can > do swap out to a congested queue and block there. > > pageout() has the following comment for the cases: > > /* > * If the page is dirty, only perform writeback if that write > * will be non-blocking. To prevent this allocation from being > * stalled by pagecache activity. But note that there may be > * stalls if we need to run get_block(). We could test > * PagePrivate for that. > * > * If this process is currently in __generic_file_aio_write() against > * this page's queue, we can perform writeback even if that > * will block. > * > * If the page is swapcache, write it back even if that would > * block, for some throttling. This happens by accident, because > * swap_backing_dev_info is bust: it doesn't reflect the > * congestion state of the swapdevs. Easy to fix, if needed. > */ > >> >> ------------------------------------------------------------------- >> commit 94bc3c9279ae182ca996d89dc9a56b66b06d5d8f >> Author: akpm <akpm> >> Date: Mon Sep 23 05:17:02 2002 +0000 >> >> [PATCH] low-latency page reclaim >> >> Convert the VM to not wait on other people's dirty data. >> >> - If we find a dirty page and its queue is not congested, do some writeback. >> >> - If we find a dirty page and its queue _is_ congested then just >> refile the page. >> >> - If we find a PageWriteback page then just refile the page. >> >> - There is additional throttling for write(2) callers. Within >> generic_file_write(), record their backing queue in ->current. >> Within page reclaim, if this tasks encounters a page which is dirty >> or under writeback onthis queue, block on it. This gives some more >> writer throttling and reduces the page refiling frequency. >> >> It's somewhat CPU expensive - under really heavy load we only get a 50% >> reclaim rate in pages coming off the tail of the LRU. This can be >> fixed by splitting the inactive list into reclaimable and >> non-reclaimable lists. But the CPU load isn't too bad, and latency is >> much, much more important in these situations. >> >> Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 >> took 35 minutes to compile a kernel. With this patch, it took three >> minutes, 45 seconds. >> >> I haven't done swapcache or MAP_SHARED pages yet. If there's tons of >> dirty swapcache or mmap data around we still stall heavily in page >> reclaim. That's less important. >> >> This patch also has a tweak for swapless machines: don't even bother >> bringing anon pages onto the inactive list if there is no swap online. >> >> BKrev: 3d8ea3cekcPCHjOJ65jQtjjrJMyYeA >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index a27d273..9118a57 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -1755,6 +1755,9 @@ generic_file_write_nolock(struct file *file, const struct iovec *iov, >> if (unlikely(pos < 0)) >> return -EINVAL; >> >> + /* We can write back this queue in page reclaim */ >> + current->backing_dev_info = mapping->backing_dev_info; >> + >> pagevec_init(&lru_pvec); >> >> if (unlikely(file->f_error)) { >> ------------------------------------------------------------------- >> >> But is this still necessary? now we have per-hask dirty accounting, the >> write hog tasks have already got some waiting penalty. >> >> As I said, per-task dirty accounting only makes a penalty to lots writing >> tasks. but the above makes a penalty to all of write(2) user. > > Right. We will be transferring file writeback to the flusher threads, > the whole may_write_to_queue() test can be removed at that time. > For one thing, conditional page out is disregarding the LRU age. > >> > >> > Since 90% pages are dirty anonymous pages, the chances to stall is high. >> > kswapd can hardly make smooth progress. The applications end up doing >> > direct reclaim by themselves, which also ends up stuck in pageout(). >> > They are not explicitly stalled in vmscan code, but implicitly in >> > get_request_wait() when trying to swapping out the dirty pages. >> > >> > It sure hurts responsiveness with so many applications stalled on >> > get_request_wait(). But question is, what can we do otherwise? The >> > system is running short of memory and cannot keep up freeing enough >> > memory anyway. So page allocations have to be throttled somewhere.. >> > >> > But wait.. What if there are only 50% anonymous pages? In this case >> > applications don't necessarily need to sleep in get_request_wait(). >> > The memory pressure is not really high. The poor man's solution is to >> > disable swapping totally, as the bug reporters find to be helpful.. >> > >> > One easy fix is to skip swap-out when bdi is congested and priority is >> > close to DEF_PRIORITY. However it would be unfair to selectively >> > (largely in random) keep some pages and reclaim the others that >> > actually have the same age. >> > >> > A more complete fix may be to introduce some swap_out LRU list(s). >> > Pages in it will be swap out as fast as possible by a dedicated >> > kernel thread. And pageout() can freely add pages to it until it >> > grows larger than some threshold, eg. 30% reclaimable memory, at which >> > point pageout() will stall on the list. The basic idea is to switch >> > the random get_request_wait() stalls to some more global wise stalls. >> >> Yup, I'd prefer this idea. but probably it should retrieve writeback general, >> not only swapout. > > What in my mind is (without any throttling) > > if (PageSwapcache(page)) { > if (bdi_write_congested(bdi)) You mentioned following as. "However !bdi_write_congested(bdi) is now unconditionally true for the swapper_space, which means any process can do swap out to a congested queue and block there." But you used bdi_write_congested in here. Which is right? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href