On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > >> > > >> > I'm not seeing why that should be an issue. Sure, there's some CPU > >> > overhead to context switching, but I don't see that it should be that > >> > big of a deal. > > > > It may well change the dispatch order of enough IOs for it to be > > significant on an IO bound device. > > Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I > could see the use of a workqueue maybe changing that sufficiently. But > it still sounds a bit unlikely. > > And in fact, I think I have a better explanation. > > > In outright performance on my test machine, the difference in > > files/s is noise. However, the consistency looks to be substantially > > improved and the context switch rate is now running at under > > 3,000/sec. > > Hmm. I don't recall seeing you mention how many context switches per > second you had before. What is it down from? Around 4000-4500/sec, so there's not a huge amount of context switches going on. It's not like btrfs, where this workload generates over 250,000 context switches/sec.... > However, I think I may have found something more interesting here. > > The fact is that a *normal* schedule will trigger that whole > blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock() > doesn't actually do a normal schedule at all. Those trigger a > *preemption* schedule. Ok, makes sense - the plug is not being flushed as we switch away, but Chris' patch makes it do that. > So it turns out that as far as I can see, the whole "cond_resched()" > will not start any IO at all, and it will just be left on the thread > plug until we schedule back to the thread. > > So I don't think this has anything to do with kblockd_workqueue. I > don't think it even gets to that point. > > I may be missing something, but just to humor me, can you test the > attached patch *without* Chris's patch to do explicit plugging? This > should make cond_resched() and cond_resched_lock() run the unplugging. Context switches go back to the 4-4500/sec range. Otherwise behaviour and performance is indistinguishable from Chris' patch. Cheers, Dave. PS: just hit another "did this just get broken in 4.3-rc1" issue - I can't run blktrace while there's a IO load because: $ sudo blktrace -d /dev/vdc BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory .... [ 641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0 [ 641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430 [ 641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014 [ 641.429066] 0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0 [ 641.430585] ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8 [ 641.432143] ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100 [ 641.433695] Call Trace: [ 641.434196] [<ffffffff81785609>] dump_stack+0x4b/0x72 [ 641.435219] [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130 [ 641.436402] [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0 [ 641.437684] [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0 [ 641.438933] [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400 [ 641.440240] [<ffffffff811424f8>] relay_open+0x68/0x2c0 [ 641.441299] [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0 [ 641.442492] [<ffffffff8115e041>] blk_trace_setup+0x51/0x90 [ 641.443586] [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110 [ 641.444702] [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0 [ 641.446419] [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690 [ 641.447489] [<ffffffff811fd37d>] block_ioctl+0x3d/0x50 [ 641.449194] [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490 [ 641.450491] [<ffffffff811d9629>] SyS_ioctl+0x79/0x90 [ 641.451575] [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 641.453231] Mem-Info: [ 641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0 active_file:8456 inactive_file:2070698 isolated_file:0 unevictable:1011 dirty:374605 writeback:1176 unstable:0 slab_reclaimable:1514989 slab_unreclaimable:233692 mapped:9332 shmem:2250 pagetables:1891 bounce:0 free:24305 free_pcp:1448 free_cma:0 [ 641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 641.470017] lowmem_reserve[]: 0 2960 16005 16005 [ 641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 641.479677] lowmem_reserve[]: 0 0 13044 13044 [ 641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 641.492392] lowmem_reserve[]: 0 0 0 0 [ 641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB [ 641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB [ 641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB [ 641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 641.503019] 2082056 total pagecache pages [ 641.503759] 0 pages in swap cache [ 641.504403] Swap cache stats: add 0, delete 0, find 0/0 [ 641.505374] Free swap = 497976kB [ 641.506002] Total swap = 497976kB [ 641.506622] 4194173 pages RAM [ 641.507171] 0 pages HighMem/MovableOnly [ 641.507878] 92503 pages reserved This is from: gdb) l *(relay_open+0x68) 0xffffffff811424f8 is in relay_open (kernel/relay.c:582). 577 return NULL; 578 if (subbuf_size > UINT_MAX / n_subbufs) 579 return NULL; 580 581 chan = kzalloc(sizeof(struct rchan), GFP_KERNEL); 582 if (!chan) 583 return NULL; 584 585 chan->version = RELAYFS_CHANNEL_VERSION; 586 chan->n_subbufs = n_subbufs; and struct rchan has a member struct rchan_buf *buf[NR_CPUS]; and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation that fails here.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html