On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote: > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote: > > Hello Dave, > > > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote: > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote: > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote: > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca > > > > Author: Jens Axboe <jaxboe@xxxxxxxxxxxx> > > > > Date: Sat Apr 16 13:27:55 2011 +0200 > > > > > > > > block: let io_schedule() flush the plug inline > > > > > > > > Linus correctly observes that the most important dispatch cases > > > > are now done from kblockd, this isn't ideal for latency reasons. > > > > The original reason for switching dispatches out-of-line was to > > > > avoid too deep a stack, so by _only_ letting the "accidental" > > > > flush directly in schedule() be guarded by offload to kblockd, > > > > we should be able to get the best of both worlds. > > > > > > > > So add a blk_schedule_flush_plug() that offloads to kblockd, > > > > and only use that from the schedule() path. > > > > > > > > Signed-off-by: Jens Axboe <jaxboe@xxxxxxxxxxxx> > > > > > > > > And now we have too deep a stack due to unplugging from io_schedule()... > > > > > > So, if we make io_schedule() push the plug list off to the kblockd > > > like is done for schedule().... > .... > > I did below hacky test to apply your idea and the result is overflow again. > > So, again it would second stack expansion. Otherwise, we should prevent > > swapout in direct reclaim. > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index f5c6635b806c..95f169e85dbe 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to); > > void __sched io_schedule(void) > > { > > struct rq *rq = raw_rq(); > > + struct blk_plug *plug = current->plug; > > > > delayacct_blkio_start(); > > atomic_inc(&rq->nr_iowait); > > - blk_flush_plug(current); > > + if (plug) > > + blk_flush_plug_list(plug, true); > > + > > current->in_iowait = 1; > > schedule(); > > current->in_iowait = 0; > > ..... > > > Depth Size Location (46 entries) > > > > 0) 7200 8 _raw_spin_lock_irqsave+0x51/0x60 > > 1) 7192 296 get_page_from_freelist+0x886/0x920 > > 2) 6896 352 __alloc_pages_nodemask+0x5e1/0xb20 > > 3) 6544 8 alloc_pages_current+0x10f/0x1f0 > > 4) 6536 168 new_slab+0x2c5/0x370 > > 5) 6368 8 __slab_alloc+0x3a9/0x501 > > 6) 6360 80 __kmalloc+0x1cb/0x200 > > 7) 6280 376 vring_add_indirect+0x36/0x200 > > 8) 5904 144 virtqueue_add_sgs+0x2e2/0x320 > > 9) 5760 288 __virtblk_add_req+0xda/0x1b0 > > 10) 5472 96 virtio_queue_rq+0xd3/0x1d0 > > 11) 5376 128 __blk_mq_run_hw_queue+0x1ef/0x440 > > 12) 5248 16 blk_mq_run_hw_queue+0x35/0x40 > > 13) 5232 96 blk_mq_insert_requests+0xdb/0x160 > > 14) 5136 112 blk_mq_flush_plug_list+0x12b/0x140 > > 15) 5024 112 blk_flush_plug_list+0xc7/0x220 > > 16) 4912 128 blk_mq_make_request+0x42a/0x600 > > 17) 4784 48 generic_make_request+0xc0/0x100 > > 18) 4736 112 submit_bio+0x86/0x160 > > 19) 4624 160 __swap_writepage+0x198/0x230 > > 20) 4464 32 swap_writepage+0x42/0x90 > > 21) 4432 320 shrink_page_list+0x676/0xa80 > > 22) 4112 208 shrink_inactive_list+0x262/0x4e0 > > 23) 3904 304 shrink_lruvec+0x3e1/0x6a0 > > The device is supposed to be plugged here in shrink_lruvec(). > > Oh, a plug can only hold 16 individual bios, and then it does a > synchronous flush. Hmmm - perhaps that should also defer the flush > to the kblockd, because if we are overrunning a plug then we've > already surrendered IO dispatch latency.... > > So, in blk_mq_make_request(), can you do: > > if (list_empty(&plug->mq_list)) > trace_block_plug(q); > else if (request_count >= BLK_MAX_REQUEST_COUNT) { > - blk_flush_plug_list(plug, false); > + blk_flush_plug_list(plug, true); > trace_block_plug(q); > } > list_add_tail(&rq->queuelist, &plug->mq_list); > > To see if that defers all the swap IO to kblockd? > Interim report, I applied below(we need to fix io_schedule_timeout due to mempool_alloc) diff --git a/block/blk-core.c b/block/blk-core.c index bfe16d5af9f9..0c81aacec75b 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1585,7 +1585,7 @@ get_rq: trace_block_plug(q); else { if (request_count >= BLK_MAX_REQUEST_COUNT) { - blk_flush_plug_list(plug, false); + blk_flush_plug_list(plug, true); trace_block_plug(q); } } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5c6635b806c..ebca9e1f200f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4244,7 +4244,7 @@ void __sched io_schedule(void) delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; schedule(); current->in_iowait = 0; @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout) delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); - blk_flush_plug(current); + blk_schedule_flush_plug(current); current->in_iowait = 1; ret = schedule_timeout(timeout); current->in_iowait = 0; And result is as follows, It reduce about 800-byte compared to my first report but still stack usage seems to be high. Really needs diet of VM functions. ----- ---- -------- 0) 6896 16 lookup_address+0x28/0x30 1) 6880 16 _lookup_address_cpa.isra.3+0x3b/0x40 2) 6864 304 __change_page_attr_set_clr+0xe0/0xb50 3) 6560 112 kernel_map_pages+0x6c/0x120 4) 6448 256 get_page_from_freelist+0x489/0x920 5) 6192 352 __alloc_pages_nodemask+0x5e1/0xb20 6) 5840 8 alloc_pages_current+0x10f/0x1f0 7) 5832 168 new_slab+0x35d/0x370 8) 5664 8 __slab_alloc+0x3a9/0x501 9) 5656 80 kmem_cache_alloc+0x1ac/0x1c0 10) 5576 296 mempool_alloc_slab+0x15/0x20 11) 5280 128 mempool_alloc+0x5e/0x170 12) 5152 96 bio_alloc_bioset+0x10b/0x1d0 13) 5056 48 get_swap_bio+0x30/0x90 14) 5008 160 __swap_writepage+0x150/0x230 15) 4848 32 swap_writepage+0x42/0x90 16) 4816 320 shrink_page_list+0x676/0xa80 17) 4496 208 shrink_inactive_list+0x262/0x4e0 18) 4288 304 shrink_lruvec+0x3e1/0x6a0 19) 3984 80 shrink_zone+0x3f/0x110 20) 3904 128 do_try_to_free_pages+0x156/0x4c0 21) 3776 208 try_to_free_pages+0xf7/0x1e0 22) 3568 352 __alloc_pages_nodemask+0x783/0xb20 23) 3216 8 alloc_pages_current+0x10f/0x1f0 24) 3208 168 new_slab+0x2c5/0x370 25) 3040 8 __slab_alloc+0x3a9/0x501 26) 3032 80 kmem_cache_alloc+0x1ac/0x1c0 27) 2952 296 mempool_alloc_slab+0x15/0x20 28) 2656 128 mempool_alloc+0x5e/0x170 29) 2528 96 bio_alloc_bioset+0x10b/0x1d0 30) 2432 48 mpage_alloc+0x38/0xa0 31) 2384 208 do_mpage_readpage+0x49b/0x5d0 32) 2176 224 mpage_readpages+0xcf/0x120 33) 1952 48 ext4_readpages+0x45/0x60 34) 1904 224 __do_page_cache_readahead+0x222/0x2d0 35) 1680 16 ra_submit+0x21/0x30 36) 1664 112 filemap_fault+0x2d7/0x4f0 37) 1552 144 __do_fault+0x6d/0x4c0 38) 1408 160 handle_mm_fault+0x1a6/0xaf0 39) 1248 272 __do_page_fault+0x18a/0x590 40) 976 16 do_page_fault+0xc/0x10 41) 960 208 page_fault+0x22/0x30 42) 752 16 clear_user+0x2e/0x40 43) 736 16 padzero+0x2d/0x40 44) 720 304 load_elf_binary+0xa47/0x1a40 45) 416 48 search_binary_handler+0x9c/0x1a0 46) 368 144 do_execve_common.isra.25+0x58d/0x700 47) 224 16 do_execve+0x18/0x20 48) 208 32 SyS_execve+0x2e/0x40 49) 176 176 stub_execve+0x69/0xa0 -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>