Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

"Huang\, Ying" <ying.huang@xxxxxxxxxxxxxxx> · Fri, 13 Nov 2015 16:33:04 +0800

Hugh Dickins <hughd@xxxxxxxxxx> writes:

> On Wed, 4 Nov 2015, Huang, Ying wrote:
>> Hugh Dickins <hughd@xxxxxxxxxx> writes:
>> 
>> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
>> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
>> > benchmark.
>> >
>> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC
>> > on 0-length file does indeed get into more overhead now shmem_setattr()
>> > tests "0 <= 0" instead of "0 < 0".
>> >
>> > I'm not sure how much we care, but I think it would not be too VW-like
>> > to add in a check for whether any pages (or swap) are allocated: if none
>> > are allocated, there's none to remove from the radix_tree.  At first I
>> > thought that check would be good enough for the unmaps too, but no: we
>> > should not skip the unlikely case of unmapping pages beyond the new EOF,
>> > which were COWed from holes which have now been reclaimed, leaving none.
>> >
>> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
>> > and running a debug config before and after: I hope those account for
>> > the lesser speedup.
>> >
>> > And probably someone has a benchmark where a thousand threads keep on
>> > stat'ing the same file repeatedly: forestall that report by adjusting
>> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
>> > not to take the spinlock in shmem_getattr() when there's no work to do.
>> >
>> > Reported-by: Ying Huang <ying.huang@xxxxxxxxxxxxxxx>
>> > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
>> 
>> Hi, Hugh,
>> 
>> Thanks a lot for your support!  The test on LKP shows that this patch
>> restores a big part of the regression!  In following list,
>> 
>> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
>> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
>> performance regression.
>> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch
>
> Hi Ying,
>
> Thank you, for reporting, and for trying out the patch (which is now
> in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9).
>
> But I'm disappointed by the result: do I understand correctly,
> that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6%
> from your parent comparison point?

Yes.

> If we value that microbenchmark
> at all (debatable), I'd say that's not good enough.

I think that is a good improvement.

> It does match with my own rough measurement, but I'd been hoping
> for better when done in a more controlled environment; and I cannot
> explain why "truncate prealloc blocks past i_size" creat-clo performance
> would not be fully corrected by "avoid a little creat and stat slowdown"
> (unless either patch adds subtle icache or dcache displacements).
>
> I'm not certain of how you performed the comparison.  Was the
> c435a390574d tree measured, then patch afa2db2fb6f1 applied on top
> of that and measured, then patch 43819159da2b applied on top of that
> and measured?  Or were there other intervening changes, which could
> easily add their own interference?

c435a390574d is the direct parent of afa2db2fb6f1 in its original git.
43819159da2b is your patch applied on top of v4.3-rc7.  The comparison
of 43819159da2b with v4.3-rc7 is as follow:

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s

commit: 
  32b88194f71d6ae7768a29f87fbba454728273ee
  43819159da2b77fedcf7562134d6003dccd6a068

32b88194f71d6ae7 43819159da2b77fedcf7562134 
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    475224 ±  1%     +11.9%     531968 ±  1%  aim9.creat-clo.ops_per_sec
  10469094 ±201%     -52.3%    4998529 ±130%  latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  18852332 ±223%     -73.5%    4998529 ±130%  latency_stats.max.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  21758590 ±199%     -77.0%    4998529 ±130%  latency_stats.sum.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
   4817724 ±  0%      +9.6%    5280303 ±  1%  proc-vmstat.numa_hit
   4812582 ±  0%      +9.7%    5280287 ±  1%  proc-vmstat.numa_local
   8499767 ±  4%     +14.2%    9707953 ±  4%  proc-vmstat.pgalloc_normal
   8984075 ±  0%     +10.4%    9919044 ±  1%  proc-vmstat.pgfree
      9.22 ±  8%     +27.4%      11.75 ±  9%  sched_debug.cfs_rq[0]:/.nr_spread_over
      2667 ± 63%     +90.0%       5068 ± 37%  sched_debug.cfs_rq[20]:/.min_vruntime
    152513 ±272%     -98.5%       2306 ± 48%  sched_debug.cfs_rq[21]:/.min_vruntime
    477.36 ± 60%    +128.6%       1091 ± 60%  sched_debug.cfs_rq[27]:/.exec_clock
      4.00 ±112%    +418.8%      20.75 ± 67%  sched_debug.cfs_rq[28]:/.util_avg
      1212 ± 80%    +195.0%       3577 ± 48%  sched_debug.cfs_rq[29]:/.exec_clock
      8119 ± 53%     -60.4%       3217 ± 26%  sched_debug.cfs_rq[2]:/.min_vruntime
    584.80 ± 65%     -60.0%     234.06 ± 13%  sched_debug.cfs_rq[30]:/.exec_clock
      4245 ± 27%     -42.8%       2429 ± 24%  sched_debug.cfs_rq[30]:/.min_vruntime
      0.00 ±  0%      +Inf%       2.25 ± 72%  sched_debug.cfs_rq[44]:/.util_avg
      1967 ± 39%     +72.0%       3384 ± 15%  sched_debug.cfs_rq[61]:/.min_vruntime
      1863 ± 43%     +99.2%       3710 ± 33%  sched_debug.cfs_rq[72]:/.min_vruntime
      0.78 ±336%    -678.6%      -4.50 ±-33%  sched_debug.cpu#12.nr_uninterruptible
     10686 ± 49%     +77.8%      19002 ± 34%  sched_debug.cpu#15.nr_switches
      5256 ± 50%     +79.0%       9410 ± 34%  sched_debug.cpu#15.sched_goidle
     -2.00 ±-139%    -225.0%       2.50 ± 44%  sched_debug.cpu#21.nr_uninterruptible
     -1.78 ±-105%    -156.2%       1.00 ±141%  sched_debug.cpu#23.nr_uninterruptible
     45017 ±132%     -76.1%      10741 ± 30%  sched_debug.cpu#24.nr_load_updates
      2216 ± 14%     +73.3%       3839 ± 63%  sched_debug.cpu#35.nr_switches
      2223 ± 14%     +73.0%       3845 ± 63%  sched_debug.cpu#35.sched_count
      1030 ± 13%     +79.1%       1845 ± 66%  sched_debug.cpu#35.sched_goidle
      2.00 ± 40%     +37.5%       2.75 ± 82%  sched_debug.cpu#46.nr_uninterruptible
    907.11 ± 67%    +403.7%       4569 ± 75%  sched_debug.cpu#59.ttwu_count
     -4.56 ±-41%     -94.5%      -0.25 ±-714%  sched_debug.cpu#64.nr_uninterruptible

So you patch improved 11.9% from its base v4.3-rc7.  I think other
difference are caused by other changes.  Sorry for confusing.

Best Regards,
Huang, Ying

> Hugh
>
>> 
>> =========================================================================================
>> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>>   gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s
>> 
>> commit: 
>>   c435a390574d012f8d30074135d8fcc6f480b484
>>   afa2db2fb6f15f860069de94a1257db57589fe95
>>   43819159da2b77fedcf7562134d6003dccd6a068
>> 
>> c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 
>> ---------------- -------------------------- -------------------------- 
>>          %stddev     %change         %stddev     %change         %stddev
>>              \          |                \          |                \  
>>     563556 ±  1%     -12.5%     493033 ±  5%      -5.6%     531968 ±  1%  aim9.creat-clo.ops_per_sec
>>      11836 ±  7%     +11.4%      13184 ±  7%     +15.0%      13608 ±  5%  numa-meminfo.node1.SReclaimable
>>   10121526 ±  3%     -12.1%    8897097 ±  5%      -4.1%    9707953 ±  4%  proc-vmstat.pgalloc_normal
>>       9.34 ±  4%     -11.4%       8.28 ±  3%      -4.8%       8.88 ±  2%  time.user_time
>>       3480 ±  3%      -2.5%       3395 ±  1%     -28.5%       2488 ±  3%  vmstat.system.cs
>>     203275 ± 17%      -6.8%     189453 ±  5%     -34.4%     133352 ± 11%  cpuidle.C1-NHM.usage
>>    8081280 ±129%     -93.3%     538377 ± 97%     +31.5%   10625496 ±106%  cpuidle.C1E-NHM.time
>>       3144 ± 58%    +619.0%      22606 ± 56%    +903.9%      31563 ±  0%  numa-vmstat.node0.numa_other
>>       2958 ±  7%     +11.4%       3295 ±  7%     +15.0%       3401 ±  5%  numa-vmstat.node1.nr_slab_reclaimable
>>      45074 ±  5%     -43.4%      25494 ± 57%     -68.7%      14105 ±  2%  numa-vmstat.node2.numa_other
>>      56140 ±  0%      +0.0%      56158 ±  0%     -94.4%       3120 ±  0%  slabinfo.Acpi-ParseExt.active_objs
>>       1002 ±  0%      +0.0%       1002 ±  0%     -92.0%      80.00 ±  0%  slabinfo.Acpi-ParseExt.active_slabs
>>      56140 ±  0%      +0.0%      56158 ±  0%     -94.4%       3120 ±  0%  slabinfo.Acpi-ParseExt.num_objs
>>       1002 ±  0%      +0.0%       1002 ±  0%     -92.0%      80.00 ±  0%  slabinfo.Acpi-ParseExt.num_slabs
>>       1079 ±  5%     -10.8%     962.00 ± 10%    -100.0%       0.00 ± -1%  slabinfo.blkdev_ioc.active_objs
>>       1079 ±  5%     -10.8%     962.00 ± 10%    -100.0%       0.00 ± -1%  slabinfo.blkdev_ioc.num_objs
>>     110.67 ± 39%     +74.4%     193.00 ± 46%    +317.5%     462.00 ±  8%  slabinfo.blkdev_queue.active_objs
>>     189.33 ± 23%     +43.7%     272.00 ± 33%    +151.4%     476.00 ± 10%  slabinfo.blkdev_queue.num_objs
>>       1129 ± 10%      -1.9%       1107 ±  7%     +20.8%       1364 ±  6%  slabinfo.blkdev_requests.active_objs
>>       1129 ± 10%      -1.9%       1107 ±  7%     +20.8%       1364 ±  6%  slabinfo.blkdev_requests.num_objs
>>       1058 ±  3%     -10.3%     949.00 ±  9%    -100.0%       0.00 ± -1%  slabinfo.file_lock_ctx.active_objs
>>       1058 ±  3%     -10.3%     949.00 ±  9%    -100.0%       0.00 ± -1%  slabinfo.file_lock_ctx.num_objs
>>       4060 ±  1%      -2.1%       3973 ±  1%     -10.5%       3632 ±  1%  slabinfo.files_cache.active_objs
>>       4060 ±  1%      -2.1%       3973 ±  1%     -10.5%       3632 ±  1%  slabinfo.files_cache.num_objs
>>      10001 ±  0%      -0.3%       9973 ±  0%     -61.1%       3888 ±  0%  slabinfo.ftrace_event_field.active_objs
>>      10001 ±  0%      -0.3%       9973 ±  0%     -61.1%       3888 ±  0%  slabinfo.ftrace_event_field.num_objs
>>       1832 ±  0%      +0.4%       1840 ±  0%    -100.0%       0.00 ± -1%  slabinfo.ftrace_event_file.active_objs
>>       1832 ±  0%      +0.4%       1840 ±  0%    -100.0%       0.00 ± -1%  slabinfo.ftrace_event_file.num_objs
>>       1491 ±  5%      -2.3%       1456 ±  6%     +12.0%       1669 ±  4%  slabinfo.mnt_cache.active_objs
>>       1491 ±  5%      -2.3%       1456 ±  6%     +12.0%       1669 ±  4%  slabinfo.mnt_cache.num_objs
>>     126.33 ± 19%     +10.2%     139.17 ±  9%    -100.0%       0.00 ± -1%  slabinfo.nfs_commit_data.active_objs
>>     126.33 ± 19%     +10.2%     139.17 ±  9%    -100.0%       0.00 ± -1%  slabinfo.nfs_commit_data.num_objs
>>      97.17 ± 20%      -9.1%      88.33 ± 28%    -100.0%       0.00 ± -1%  slabinfo.user_namespace.active_objs
>>      97.17 ± 20%      -9.1%      88.33 ± 28%    -100.0%       0.00 ± -1%  slabinfo.user_namespace.num_objs
>> 
>> Best Regards,
>> Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>