Hugh Dickins <hughd@xxxxxxxxxx> writes: > On Wed, 4 Nov 2015, Huang, Ying wrote: >> Hugh Dickins <hughd@xxxxxxxxxx> writes: >> >> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc >> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo >> > benchmark. >> > >> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC >> > on 0-length file does indeed get into more overhead now shmem_setattr() >> > tests "0 <= 0" instead of "0 < 0". >> > >> > I'm not sure how much we care, but I think it would not be too VW-like >> > to add in a check for whether any pages (or swap) are allocated: if none >> > are allocated, there's none to remove from the radix_tree. At first I >> > thought that check would be good enough for the unmaps too, but no: we >> > should not skip the unlikely case of unmapping pages beyond the new EOF, >> > which were COWed from holes which have now been reclaimed, leaving none. >> > >> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, >> > and running a debug config before and after: I hope those account for >> > the lesser speedup. >> > >> > And probably someone has a benchmark where a thousand threads keep on >> > stat'ing the same file repeatedly: forestall that report by adjusting >> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat") >> > not to take the spinlock in shmem_getattr() when there's no work to do. >> > >> > Reported-by: Ying Huang <ying.huang@xxxxxxxxxxxxxxx> >> > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> >> >> Hi, Hugh, >> >> Thanks a lot for your support! The test on LKP shows that this patch >> restores a big part of the regression! In following list, >> >> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit >> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has >> performance regression. >> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch > > Hi Ying, > > Thank you, for reporting, and for trying out the patch (which is now > in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9). > > But I'm disappointed by the result: do I understand correctly, > that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6% > from your parent comparison point? Yes. > If we value that microbenchmark > at all (debatable), I'd say that's not good enough. I think that is a good improvement. > It does match with my own rough measurement, but I'd been hoping > for better when done in a more controlled environment; and I cannot > explain why "truncate prealloc blocks past i_size" creat-clo performance > would not be fully corrected by "avoid a little creat and stat slowdown" > (unless either patch adds subtle icache or dcache displacements). > > I'm not certain of how you performed the comparison. Was the > c435a390574d tree measured, then patch afa2db2fb6f1 applied on top > of that and measured, then patch 43819159da2b applied on top of that > and measured? Or were there other intervening changes, which could > easily add their own interference? c435a390574d is the direct parent of afa2db2fb6f1 in its original git. 43819159da2b is your patch applied on top of v4.3-rc7. The comparison of 43819159da2b with v4.3-rc7 is as follow: ========================================================================================= compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime: gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s commit: 32b88194f71d6ae7768a29f87fbba454728273ee 43819159da2b77fedcf7562134d6003dccd6a068 32b88194f71d6ae7 43819159da2b77fedcf7562134 ---------------- -------------------------- %stddev %change %stddev \ | \ 475224 ± 1% +11.9% 531968 ± 1% aim9.creat-clo.ops_per_sec 10469094 ±201% -52.3% 4998529 ±130% latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath 18852332 ±223% -73.5% 4998529 ±130% latency_stats.max.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath 21758590 ±199% -77.0% 4998529 ±130% latency_stats.sum.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath 4817724 ± 0% +9.6% 5280303 ± 1% proc-vmstat.numa_hit 4812582 ± 0% +9.7% 5280287 ± 1% proc-vmstat.numa_local 8499767 ± 4% +14.2% 9707953 ± 4% proc-vmstat.pgalloc_normal 8984075 ± 0% +10.4% 9919044 ± 1% proc-vmstat.pgfree 9.22 ± 8% +27.4% 11.75 ± 9% sched_debug.cfs_rq[0]:/.nr_spread_over 2667 ± 63% +90.0% 5068 ± 37% sched_debug.cfs_rq[20]:/.min_vruntime 152513 ±272% -98.5% 2306 ± 48% sched_debug.cfs_rq[21]:/.min_vruntime 477.36 ± 60% +128.6% 1091 ± 60% sched_debug.cfs_rq[27]:/.exec_clock 4.00 ±112% +418.8% 20.75 ± 67% sched_debug.cfs_rq[28]:/.util_avg 1212 ± 80% +195.0% 3577 ± 48% sched_debug.cfs_rq[29]:/.exec_clock 8119 ± 53% -60.4% 3217 ± 26% sched_debug.cfs_rq[2]:/.min_vruntime 584.80 ± 65% -60.0% 234.06 ± 13% sched_debug.cfs_rq[30]:/.exec_clock 4245 ± 27% -42.8% 2429 ± 24% sched_debug.cfs_rq[30]:/.min_vruntime 0.00 ± 0% +Inf% 2.25 ± 72% sched_debug.cfs_rq[44]:/.util_avg 1967 ± 39% +72.0% 3384 ± 15% sched_debug.cfs_rq[61]:/.min_vruntime 1863 ± 43% +99.2% 3710 ± 33% sched_debug.cfs_rq[72]:/.min_vruntime 0.78 ±336% -678.6% -4.50 ±-33% sched_debug.cpu#12.nr_uninterruptible 10686 ± 49% +77.8% 19002 ± 34% sched_debug.cpu#15.nr_switches 5256 ± 50% +79.0% 9410 ± 34% sched_debug.cpu#15.sched_goidle -2.00 ±-139% -225.0% 2.50 ± 44% sched_debug.cpu#21.nr_uninterruptible -1.78 ±-105% -156.2% 1.00 ±141% sched_debug.cpu#23.nr_uninterruptible 45017 ±132% -76.1% 10741 ± 30% sched_debug.cpu#24.nr_load_updates 2216 ± 14% +73.3% 3839 ± 63% sched_debug.cpu#35.nr_switches 2223 ± 14% +73.0% 3845 ± 63% sched_debug.cpu#35.sched_count 1030 ± 13% +79.1% 1845 ± 66% sched_debug.cpu#35.sched_goidle 2.00 ± 40% +37.5% 2.75 ± 82% sched_debug.cpu#46.nr_uninterruptible 907.11 ± 67% +403.7% 4569 ± 75% sched_debug.cpu#59.ttwu_count -4.56 ±-41% -94.5% -0.25 ±-714% sched_debug.cpu#64.nr_uninterruptible So you patch improved 11.9% from its base v4.3-rc7. I think other difference are caused by other changes. Sorry for confusing. Best Regards, Huang, Ying > Hugh > >> >> ========================================================================================= >> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime: >> gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s >> >> commit: >> c435a390574d012f8d30074135d8fcc6f480b484 >> afa2db2fb6f15f860069de94a1257db57589fe95 >> 43819159da2b77fedcf7562134d6003dccd6a068 >> >> c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 >> ---------------- -------------------------- -------------------------- >> %stddev %change %stddev %change %stddev >> \ | \ | \ >> 563556 ± 1% -12.5% 493033 ± 5% -5.6% 531968 ± 1% aim9.creat-clo.ops_per_sec >> 11836 ± 7% +11.4% 13184 ± 7% +15.0% 13608 ± 5% numa-meminfo.node1.SReclaimable >> 10121526 ± 3% -12.1% 8897097 ± 5% -4.1% 9707953 ± 4% proc-vmstat.pgalloc_normal >> 9.34 ± 4% -11.4% 8.28 ± 3% -4.8% 8.88 ± 2% time.user_time >> 3480 ± 3% -2.5% 3395 ± 1% -28.5% 2488 ± 3% vmstat.system.cs >> 203275 ± 17% -6.8% 189453 ± 5% -34.4% 133352 ± 11% cpuidle.C1-NHM.usage >> 8081280 ±129% -93.3% 538377 ± 97% +31.5% 10625496 ±106% cpuidle.C1E-NHM.time >> 3144 ± 58% +619.0% 22606 ± 56% +903.9% 31563 ± 0% numa-vmstat.node0.numa_other >> 2958 ± 7% +11.4% 3295 ± 7% +15.0% 3401 ± 5% numa-vmstat.node1.nr_slab_reclaimable >> 45074 ± 5% -43.4% 25494 ± 57% -68.7% 14105 ± 2% numa-vmstat.node2.numa_other >> 56140 ± 0% +0.0% 56158 ± 0% -94.4% 3120 ± 0% slabinfo.Acpi-ParseExt.active_objs >> 1002 ± 0% +0.0% 1002 ± 0% -92.0% 80.00 ± 0% slabinfo.Acpi-ParseExt.active_slabs >> 56140 ± 0% +0.0% 56158 ± 0% -94.4% 3120 ± 0% slabinfo.Acpi-ParseExt.num_objs >> 1002 ± 0% +0.0% 1002 ± 0% -92.0% 80.00 ± 0% slabinfo.Acpi-ParseExt.num_slabs >> 1079 ± 5% -10.8% 962.00 ± 10% -100.0% 0.00 ± -1% slabinfo.blkdev_ioc.active_objs >> 1079 ± 5% -10.8% 962.00 ± 10% -100.0% 0.00 ± -1% slabinfo.blkdev_ioc.num_objs >> 110.67 ± 39% +74.4% 193.00 ± 46% +317.5% 462.00 ± 8% slabinfo.blkdev_queue.active_objs >> 189.33 ± 23% +43.7% 272.00 ± 33% +151.4% 476.00 ± 10% slabinfo.blkdev_queue.num_objs >> 1129 ± 10% -1.9% 1107 ± 7% +20.8% 1364 ± 6% slabinfo.blkdev_requests.active_objs >> 1129 ± 10% -1.9% 1107 ± 7% +20.8% 1364 ± 6% slabinfo.blkdev_requests.num_objs >> 1058 ± 3% -10.3% 949.00 ± 9% -100.0% 0.00 ± -1% slabinfo.file_lock_ctx.active_objs >> 1058 ± 3% -10.3% 949.00 ± 9% -100.0% 0.00 ± -1% slabinfo.file_lock_ctx.num_objs >> 4060 ± 1% -2.1% 3973 ± 1% -10.5% 3632 ± 1% slabinfo.files_cache.active_objs >> 4060 ± 1% -2.1% 3973 ± 1% -10.5% 3632 ± 1% slabinfo.files_cache.num_objs >> 10001 ± 0% -0.3% 9973 ± 0% -61.1% 3888 ± 0% slabinfo.ftrace_event_field.active_objs >> 10001 ± 0% -0.3% 9973 ± 0% -61.1% 3888 ± 0% slabinfo.ftrace_event_field.num_objs >> 1832 ± 0% +0.4% 1840 ± 0% -100.0% 0.00 ± -1% slabinfo.ftrace_event_file.active_objs >> 1832 ± 0% +0.4% 1840 ± 0% -100.0% 0.00 ± -1% slabinfo.ftrace_event_file.num_objs >> 1491 ± 5% -2.3% 1456 ± 6% +12.0% 1669 ± 4% slabinfo.mnt_cache.active_objs >> 1491 ± 5% -2.3% 1456 ± 6% +12.0% 1669 ± 4% slabinfo.mnt_cache.num_objs >> 126.33 ± 19% +10.2% 139.17 ± 9% -100.0% 0.00 ± -1% slabinfo.nfs_commit_data.active_objs >> 126.33 ± 19% +10.2% 139.17 ± 9% -100.0% 0.00 ± -1% slabinfo.nfs_commit_data.num_objs >> 97.17 ± 20% -9.1% 88.33 ± 28% -100.0% 0.00 ± -1% slabinfo.user_namespace.active_objs >> 97.17 ± 20% -9.1% 88.33 ± 28% -100.0% 0.00 ± -1% slabinfo.user_namespace.num_objs >> >> Best Regards, >> Huang, Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>