Final result, I tested the machine below patch (Dave suggested + some part I modified) and I couldn't see the problem any more(tested 4hr, I will queue it into the machine during weekend for long running test if I don't get more enhanced version before leaving the office today) but as I reported interim result, still VM's stack usage is high. Anyway, it's another issue we should really diet of VM functions (ex, uninlining slow path part from __alloc_pages_nodemask and alloc_info idea from Linus and more). Looking forwad to seeing blk_plug_start_async way. Thanks, Dave! --- block/blk-core.c | 2 +- block/blk-mq.c | 2 +- kernel/sched/core.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } } diff --git a/block/blk-mq.c b/block/blk-mq.c index 883f72089015..6e72e700d11e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio) if (list_empty(&plug->mq_list)) trace_block_plug(q); else if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } list_add_tail(&rq->queuelist, &plug->mq_list); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..ebca9e1f200f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4244,7 +4244,7 @@ void __sched io_schedule(void) delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; schedule(); current->in_iowait = 0; @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout) delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; ret = schedule_timeout(timeout); current->in_iowait = 0; -- 1.9.2 On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote: > On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: > > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: > > > Hello Dave, > > > > > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > > > > Author: Jens Axboe <jaxboe@xxxxxxxxxxxx> > > > > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > > > > > > > block: let io_schedule() flush the plug inline > > > > > > > > > > Linus correctly observes that the most important dispatch cases > > > > > are now done from kblockd, this isn't ideal for latency reasons. > > > > > The original reason for switching dispatches out-of-line was to > > > > > avoid too deep a stack, so by _only_ letting the "accidental" > > > > > flush directly in schedule() be guarded by offload to kblockd, > > > > > we should be able to get the best of both worlds. > > > > > > > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > > > > and only use that from the schedule() path. > > > > > > > > > > Signed-off-by: Jens Axboe <jaxboe@xxxxxxxxxxxx> > > > > > > > > > > And now we have too deep a stack due to unplugging from io_schedule()... > > > > > > > > So, if we make io_schedule() push the plug list off to the kblockd > > > > like is done for schedule().... > > .... > > > I did below hacky test to apply your idea and the result is overflow again. > > > So, again it would second stack expansion. Otherwise, we should prevent > > > swapout in direct reclaim. > > > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > index f5c6635b806c..95f169e85dbe 100644 > > > --- a/kernel/sched/core.c > > > +++ b/kernel/sched/core.c > > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); > > > void __sched io_schedule(void) > > > { > > > struct rq *rq = raw_rq(); > > > + struct blk_plug *plug = current->plug; > > > > > > delayacct_blkio_start(); > > > atomic_inc(&rq->nr_iowait); > > > - blk_flush_plug(current); > > > + if (plug) > > > + blk_flush_plug_list(plug, true); > > > + > > > current->in_iowait = 1; > > > schedule(); > > > current->in_iowait = 0; > > > > ..... > > > > > Depth Size Location (46 entries) > > > > > > 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 > > > 1) 7192 296 get_page_from_freelist+0x886/0x920 > > > 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 > > > 3) 6544 8 alloc_pages_current+0x10f/0x1f0 > > > 4) 6536 168 new_slab+0x2c5/0x370 > > > 5) 6368 8 __slab_alloc+0x3a9/0x501 > > > 6) 6360 80 __kmalloc+0x1cb/0x200 > > > 7) 6280 376 vring_add_indirect+0x36/0x200 > > > 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 > > > 9) 5760 288 __virtblk_add_req+0xda/0x1b0 > > > 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 > > > 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 > > > 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 > > > 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 > > > 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 > > > 15) 5024 112 blk_flush_plug_list+0xc7/0x220 > > > 16) 4912 128 blk_mq_make_request+0x42a/0x600 > > > 17) 4784 48 generic_make_request+0xc0/0x100 > > > 18) 4736 112 submit_bio+0x86/0x160 > > > 19) 4624 160 __swap_writepage+0x198/0x230 > > > 20) 4464 32 swap_writepage+0x42/0x90 > > > 21) 4432 320 shrink_page_list+0x676/0xa80 > > > 22) 4112 208 shrink_inactive_list+0x262/0x4e0 > > > 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 > > > > The device is supposed to be plugged here in shrink_lruvec(). > > > > Oh, a plug can only hold 16 individual bios, and then it does a > > synchronous flush. Hmmm - perhaps that should also defer the flush > > to the kblockd, because if we are overrunning a plug then we've > > already surrendered IO dispatch latency.... > > > > So, in blk_mq_make_request(), can you do: > > > > if (list_empty(&plug->mq_list)) > > trace_block_plug(q); > > else if (request_count >= BLK_MAX_REQUEST_COUNT) { > > - blk_flush_plug_list(plug, false); > > + blk_flush_plug_list(plug, true); > > trace_block_plug(q); > > } > > list_add_tail(&rq->queuelist, &plug->mq_list); > > > > To see if that defers all the swap IO to kblockd? > > > > Interim report, > > I applied below(we need to fix io_schedule_timeout due to mempool_alloc) > > diff --git a/block/blk-core.c b/block/blk-core.c > index bfe16d5af9f9..0c81aacec75b 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1585,7 +1585,7 @@ get_rq: > trace_block_plug(q); > else { > if (request_count >= BLK_MAX_REQUEST_COUNT) { > - blk_flush_plug_list(plug, false); > + blk_flush_plug_list(plug, true); > trace_block_plug(q); > } > } > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index f5c6635b806c..ebca9e1f200f 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4244,7 +4244,7 @@ void __sched io_schedule(void) > > delayacct_blkio_start(); > atomic_inc(&rq->nr_iowait); > - blk_flush_plug(current); > + blk_schedule_flush_plug(current); > current->in_iowait = 1; > schedule(); > current->in_iowait = 0; > @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout) > > delayacct_blkio_start(); > atomic_inc(&rq->nr_iowait); > - blk_flush_plug(current); > + blk_schedule_flush_plug(current); > current->in_iowait = 1; > ret = schedule_timeout(timeout); > current->in_iowait = 0; > > And result is as follows, It reduce about 800-byte compared to > my first report but still stack usage seems to be high. > Really needs diet of VM functions. > > ----- ---- -------- > 0) 6896 16 lookup_address+0x28/0x30 > 1) 6880 16 _lookup_address_cpa.isra.3+0x3b/0x40 > 2) 6864 304 __change_page_attr_set_clr+0xe0/0xb50 > 3) 6560 112 kernel_map_pages+0x6c/0x120 > 4) 6448 256 get_page_from_freelist+0x489/0x920 > 5) 6192 352 __alloc_pages_nodemask+0x5e1/0xb20 > 6) 5840 8 alloc_pages_current+0x10f/0x1f0 > 7) 5832 168 new_slab+0x35d/0x370 > 8) 5664 8 __slab_alloc+0x3a9/0x501 > 9) 5656 80 kmem_cache_alloc+0x1ac/0x1c0 > 10) 5576 296 mempool_alloc_slab+0x15/0x20 > 11) 5280 128 mempool_alloc+0x5e/0x170 > 12) 5152 96 bio_alloc_bioset+0x10b/0x1d0 > 13) 5056 48 get_swap_bio+0x30/0x90 > 14) 5008 160 __swap_writepage+0x150/0x230 > 15) 4848 32 swap_writepage+0x42/0x90 > 16) 4816 320 shrink_page_list+0x676/0xa80 > 17) 4496 208 shrink_inactive_list+0x262/0x4e0 > 18) 4288 304 shrink_lruvec+0x3e1/0x6a0 > 19) 3984 80 shrink_zone+0x3f/0x110 > 20) 3904 128 do_try_to_free_pages+0x156/0x4c0 > 21) 3776 208 try_to_free_pages+0xf7/0x1e0 > 22) 3568 352 __alloc_pages_nodemask+0x783/0xb20 > 23) 3216 8 alloc_pages_current+0x10f/0x1f0 > 24) 3208 168 new_slab+0x2c5/0x370 > 25) 3040 8 __slab_alloc+0x3a9/0x501 > 26) 3032 80 kmem_cache_alloc+0x1ac/0x1c0 > 27) 2952 296 mempool_alloc_slab+0x15/0x20 > 28) 2656 128 mempool_alloc+0x5e/0x170 > 29) 2528 96 bio_alloc_bioset+0x10b/0x1d0 > 30) 2432 48 mpage_alloc+0x38/0xa0 > 31) 2384 208 do_mpage_readpage+0x49b/0x5d0 > 32) 2176 224 mpage_readpages+0xcf/0x120 > 33) 1952 48 ext4_readpages+0x45/0x60 > 34) 1904 224 __do_page_cache_readahead+0x222/0x2d0 > 35) 1680 16 ra_submit+0x21/0x30 > 36) 1664 112 filemap_fault+0x2d7/0x4f0 > 37) 1552 144 __do_fault+0x6d/0x4c0 > 38) 1408 160 handle_mm_fault+0x1a6/0xaf0 > 39) 1248 272 __do_page_fault+0x18a/0x590 > 40) 976 16 do_page_fault+0xc/0x10 > 41) 960 208 page_fault+0x22/0x30 > 42) 752 16 clear_user+0x2e/0x40 > 43) 736 16 padzero+0x2d/0x40 > 44) 720 304 load_elf_binary+0xa47/0x1a40 > 45) 416 48 search_binary_handler+0x9c/0x1a0 > 46) 368 144 do_execve_common.isra.25+0x58d/0x700 > 47) 224 16 do_execve+0x18/0x20 > 48) 208 32 SyS_execve+0x2e/0x40 > 49) 176 176 stub_execve+0x69/0xa0 > > > > -- > Kind regards, > Minchan Kim > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>