Re: [PATCH V6 00/30] block: support multipage bvec

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 21 Jun 2018 09:17:05 +0800

On Fri, Jun 15, 2018 at 02:59:19PM +0200, Gi-Oh Kim wrote:
> >
> > - bio size can be increased and it should improve some high-bandwidth IO
> > case in theory[4].
> >
> 
> Hi,
> 
> I would like to report your patch set works well on my system based on v4.14.48.
> I thought the multipage bvec could improve the performance of my system.
> (FYI, my system has v4.14.48 and provides KVM-base virtualization service.)

Thanks for your test!

> 
> So I did back-porting your patches to v4.14.48.
> It has done without any serious problem.
> I only needed to cherry-pick "blk-merge: compute
> bio->bi_seg_front_size efficiently" and
> "block: move bio_alloc_pages() to bcache" patches before back-porting
> to prevent conflicts.

Not sure I understand your point, you have to backport all patches.

> And I ran my own test-suit for checking features of md and RAID1 layer.
> There was no problem. All test cases passed.
> (If you want, I will send you the back-ported patches.)
> 
> Then I did two performance test as following.
> To say the conclusion first, I failed to show performance improvement
> of the patch set.
> Of course, my test cases would not be suitable to test your patch set.
> Or maybe I did test wrong.
> Please inform me which tools are suitable, then I will try them.
> 
> 1. fio
> 
> First I ran fio with null device to check the performance of the block-layer.
> I am not sure those test is suitable to show the performance
> improvement or degradation.
> Nevertheless there was a little (-6%) performance degradation.
> 
> If it is not much trouble to you, please review my options for fio and
> inform me if I used wrong or incorrect options.
> Then I will run the test again.
> 
> 1.1 Following is my options for fio.
> 
> gkim@ib1:~/pb-ltp/benchmark/fio$ cat go_local.sh
> #!/bin/bash
> echo "fio start   : $(date)"
> echo "kernel info : $(uname -a)"
> echo "fio version : $(fio --version)"
> 
> # set "none" io-scheduler
> modprobe -r null_blk
> modprobe null_blk
> echo "none" > /sys/block/nullb0/queue/scheduler
> 
> FIO_OPTION="--direct=1 --rw=randrw:2 --time_based=1 --group_reporting \
>             --ioengine=libaio --iodepth=64 --name=fiotest --numjobs=8 \
>             --bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 \
>             --fadvise_hint=0 --iodepth_batch_submit=64
> --iodepth_batch_complete=64"
> # fio test null_blk device, so it is not necessary to run long.
> fio $FIO_OPTION --filename=/dev/nullb0 --runtime=600
> 
> 1.2 Following is the result before porting.
> 
> fio start   : Mon Jun 11 04:30:01 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
> x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
> 
> fiotest: (groupid=0, jobs=8): err= 0: pid=1655: Mon Jun 11 04:40:02 2018
>   read : io=7133.2GB, bw=12174MB/s, iops=1342.1K, runt=600001msec
>     slat (usec): min=1, max=15750, avg=123.78, stdev=153.79
>     clat (usec): min=0, max=15758, avg=24.70, stdev=77.93
>      lat (usec): min=2, max=15782, avg=148.49, stdev=167.54
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
>      | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    6],
>      | 70.00th=[   22], 80.00th=[   36], 90.00th=[   72], 95.00th=[  107],
>      | 99.00th=[  173], 99.50th=[  203], 99.90th=[  932], 99.95th=[ 1416],
>      | 99.99th=[ 2960]
>     bw (MB  /s): min= 1096, max= 2147, per=12.51%, avg=1522.69, stdev=253.89
>   write: io=7131.3GB, bw=12171MB/s, iops=1343.6K, runt=600001msec
>     slat (usec): min=1, max=15751, avg=124.73, stdev=154.11
>     clat (usec): min=0, max=15758, avg=24.69, stdev=77.84
>      lat (usec): min=2, max=15780, avg=149.43, stdev=167.82
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
>      | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    6],
>      | 70.00th=[   22], 80.00th=[   36], 90.00th=[   72], 95.00th=[  107],
>      | 99.00th=[  173], 99.50th=[  203], 99.90th=[  932], 99.95th=[ 1416],
>      | 99.99th=[ 2960]
>     bw (MB  /s): min= 1080, max= 2121, per=12.51%, avg=1522.33, stdev=253.96
>     lat (usec) : 2=21.63%, 4=37.80%, 10=2.12%, 20=6.43%, 50=16.70%
>     lat (usec) : 100=8.86%, 250=6.07%, 500=0.17%, 750=0.08%, 1000=0.05%
>     lat (msec) : 2=0.06%, 4=0.02%, 10=0.01%, 20=0.01%
>   cpu          : usr=22.39%, sys=64.19%, ctx=15425825, majf=0, minf=97
>   IO depths    : 1=1.8%, 2=1.8%, 4=8.8%, 8=14.4%, 16=12.3%, 32=41.7%, >=64=19.3%
>      submit    : 0=0.0%, 4=5.8%, 8=9.7%, 16=15.0%, 32=18.0%, 64=51.5%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
>      issued    : total=r=805764385/w=806127393/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: io=7133.2GB, aggrb=12174MB/s, minb=12174MB/s, maxb=12174MB/s,
> mint=600001msec, maxt=600001msec
>   WRITE: io=7131.3GB, aggrb=12171MB/s, minb=12171MB/s, maxb=12171MB/s,
> mint=600001msec, maxt=600001msec
> 
> Disk stats (read/write):
>   nullb0: ios=442461761/442546060, merge=363197836/363473703,
> ticks=12280990/12452480, in_queue=2740, util=0.43%
> 
> 1.3 Following is the result after porting.
> 
> fio start   : Fri Jun 15 12:42:47 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver-mpbvec+ #12 SMP Fri Jun 15
> 12:21:36 CEST 2018 x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
> Jobs: 4 (f=0): [m(1),_(2),m(1),_(1),m(2),_(1)] [100.0% done]
> [8430MB/8444MB/0KB /s] [961K/963K/0 iops] [eta 00m:00s]
> fiotest: (groupid=0, jobs=8): err= 0: pid=14096: Fri Jun 15 12:52:48 2018
>   read : io=6633.8GB, bw=11322MB/s, iops=1246.9K, runt=600005msec
>     slat (usec): min=1, max=16939, avg=135.34, stdev=156.23
>     clat (usec): min=0, max=16947, avg=26.10, stdev=78.50
>      lat (usec): min=2, max=16957, avg=161.45, stdev=168.88
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
>      | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    5],
>      | 70.00th=[   23], 80.00th=[   37], 90.00th=[   79], 95.00th=[  115],
>      | 99.00th=[  181], 99.50th=[  211], 99.90th=[  948], 99.95th=[ 1416],
>      | 99.99th=[ 2864]
>     bw (MB  /s): min= 1106, max= 2031, per=12.51%, avg=1416.05, stdev=201.81
>   write: io=6631.1GB, bw=11318MB/s, iops=1247.5K, runt=600005msec
>     slat (usec): min=1, max=16938, avg=136.48, stdev=156.54
>     clat (usec): min=0, max=16947, avg=26.08, stdev=78.43
>      lat (usec): min=2, max=16957, avg=162.58, stdev=169.15
>     clat percentiles (usec):
>      |  1.00th=[    0],  5.00th=[    1], 10.00th=[    1], 20.00th=[    1],
>      | 30.00th=[    2], 40.00th=[    2], 50.00th=[    2], 60.00th=[    5],
>      | 70.00th=[   23], 80.00th=[   37], 90.00th=[   79], 95.00th=[  115],
>      | 99.00th=[  181], 99.50th=[  211], 99.90th=[  948], 99.95th=[ 1416],
>      | 99.99th=[ 2864]
>     bw (MB  /s): min= 1084, max= 2044, per=12.51%, avg=1415.67, stdev=201.93
>     lat (usec) : 2=20.98%, 4=38.82%, 10=2.15%, 20=5.08%, 50=16.91%
>     lat (usec) : 100=8.75%, 250=6.91%, 500=0.19%, 750=0.09%, 1000=0.05%
>     lat (msec) : 2=0.07%, 4=0.02%, 10=0.01%, 20=0.01%
>   cpu          : usr=21.02%, sys=65.53%, ctx=15321661, majf=0, minf=78
>   IO depths    : 1=1.9%, 2=1.9%, 4=9.5%, 8=13.6%, 16=11.2%, 32=42.1%, >=64=19.9%
>      submit    : 0=0.0%, 4=6.3%, 8=10.1%, 16=14.1%, 32=18.2%,
> 64=51.3%, >=64=0.0%
>      complete  : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
>      issued    : total=r=748120019/w=748454509/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=64
> 
> Run status group 0 (all jobs):
>    READ: io=6633.8GB, aggrb=11322MB/s, minb=11322MB/s, maxb=11322MB/s,
> mint=600005msec, maxt=600005msec
>   WRITE: io=6631.1GB, aggrb=11318MB/s, minb=11318MB/s, maxb=11318MB/s,
> mint=600005msec, maxt=600005msec
> 
> Disk stats (read/write):
>   nullb0: ios=410911387/410974086, merge=337127604/337396176,
> ticks=12482050/12662790, in_queue=1780, util=0.27%
> 
> 
> 2. Unixbench
> 
> Second I rand Unixbench to check general performance.
> I think there is no difference before and after porting the patches.
> Unixbench might not be suitable to check the performance improvement
> of the block layer.
> If you inform me which tools is suitable, I will try it on my system.
> 
> 2.1 Following is the result before porting.
> 
>    BYTE UNIX Benchmarks (Version 5.1.3)
> 
>    System: ib1: GNU/Linux
>    OS: GNU/Linux -- 4.14.48-1-pserver --
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
>    Machine: x86_64 (unknown)
>    Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
>    CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    05:00:01 up 3 days, 16:20,  2 users,  load average: 0.00, 0.11,
> 1.11; runlevel 2018-06-07
> 
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:00:01 - 05:28:54
> 8 CPUs in system; running 1 parallel copy of tests
> 
> Dhrystone 2 using register variables       47158867.7 lps   (10.0 s, 7 samples)
> Double-Precision Whetstone                     3878.8 MWIPS (15.2 s, 7 samples)
> Execl Throughput                               9203.9 lps   (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks       1490834.8 KBps  (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks          388784.2 KBps  (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks       3744780.2 KBps  (30.0 s, 2 samples)
> Pipe Throughput                             2682620.1 lps   (10.0 s, 7 samples)
> Pipe-based Context Switching                 263786.5 lps   (10.0 s, 7 samples)
> Process Creation                              19674.0 lps   (30.0 s, 2 samples)
> Shell Scripts (1 concurrent)                  16121.5 lpm   (60.0 s, 2 samples)
> Shell Scripts (8 concurrent)                   5623.5 lpm   (60.0 s, 2 samples)
> System Call Overhead                        4068991.3 lps   (10.0 s, 7 samples)
> 
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   47158867.7   4041.0
> Double-Precision Whetstone                       55.0       3878.8    705.2
> Execl Throughput                                 43.0       9203.9   2140.4
> File Copy 1024 bufsize 2000 maxblocks          3960.0    1490834.8   3764.7
> File Copy 256 bufsize 500 maxblocks            1655.0     388784.2   2349.1
> File Copy 4096 bufsize 8000 maxblocks          5800.0    3744780.2   6456.5
> Pipe Throughput                               12440.0    2682620.1   2156.4
> Pipe-based Context Switching                   4000.0     263786.5    659.5
> Process Creation                                126.0      19674.0   1561.4
> Shell Scripts (1 concurrent)                     42.4      16121.5   3802.2
> Shell Scripts (8 concurrent)                      6.0       5623.5   9372.5
> System Call Overhead                          15000.0    4068991.3   2712.7
>                                                                    ========
> System Benchmarks Index Score                                        2547.7
> 
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:28:54 - 05:57:07
> 8 CPUs in system; running 8 parallel copies of tests
> 
> Dhrystone 2 using register variables      234727639.9 lps   (10.0 s, 7 samples)
> Double-Precision Whetstone                    35350.9 MWIPS (10.7 s, 7 samples)
> Execl Throughput                              43811.3 lps   (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks       1401373.1 KBps  (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks          366033.9 KBps  (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks       4360829.6 KBps  (30.0 s, 2 samples)
> Pipe Throughput                            12875165.6 lps   (10.0 s, 7 samples)
> Pipe-based Context Switching                2431725.6 lps   (10.0 s, 7 samples)
> Process Creation                              97360.8 lps   (30.0 s, 2 samples)
> Shell Scripts (1 concurrent)                  58879.6 lpm   (60.0 s, 2 samples)
> Shell Scripts (8 concurrent)                   9232.5 lpm   (60.0 s, 2 samples)
> System Call Overhead                        9497958.7 lps   (10.0 s, 7 samples)
> 
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0  234727639.9  20113.8
> Double-Precision Whetstone                       55.0      35350.9   6427.4
> Execl Throughput                                 43.0      43811.3  10188.7
> File Copy 1024 bufsize 2000 maxblocks          3960.0    1401373.1   3538.8
> File Copy 256 bufsize 500 maxblocks            1655.0     366033.9   2211.7
> File Copy 4096 bufsize 8000 maxblocks          5800.0    4360829.6   7518.7
> Pipe Throughput                               12440.0   12875165.6  10349.8
> Pipe-based Context Switching                   4000.0    2431725.6   6079.3
> Process Creation                                126.0      97360.8   7727.0
> Shell Scripts (1 concurrent)                     42.4      58879.6  13886.7
> Shell Scripts (8 concurrent)                      6.0       9232.5  15387.5
> System Call Overhead                          15000.0    9497958.7   6332.0
>                                                                    ========
> System Benchmarks Index Score                                        7803.5
> 
> 
> 2.2 Following is the result after porting.
> 
>    BYTE UNIX Benchmarks (Version 5.1.3)
> 
>    System: ib1: GNU/Linux
>    OS: GNU/Linux -- 4.14.48-1-pserver-mpbvec+ -- #12 SMP Fri Jun 15
> 12:21:36 CEST 2018
>    Machine: x86_64 (unknown)
>    Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
>    CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
>           Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
>    13:16:11 up 50 min,  1 user,  load average: 0.00, 1.40, 3.46;
> runlevel 2018-06-15
> 
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:16:11 - 13:45:04
> 8 CPUs in system; running 1 parallel copy of tests
> 
> Dhrystone 2 using register variables       47103754.6 lps   (10.0 s, 7 samples)
> Double-Precision Whetstone                     3886.3 MWIPS (15.1 s, 7 samples)
> Execl Throughput                               8965.0 lps   (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks       1510285.9 KBps  (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks          395196.9 KBps  (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks       3802788.0 KBps  (30.0 s, 2 samples)
> Pipe Throughput                             2670169.1 lps   (10.0 s, 7 samples)
> Pipe-based Context Switching                 275093.8 lps   (10.0 s, 7 samples)
> Process Creation                              19707.1 lps   (30.0 s, 2 samples)
> Shell Scripts (1 concurrent)                  16046.8 lpm   (60.0 s, 2 samples)
> Shell Scripts (8 concurrent)                   5600.8 lpm   (60.0 s, 2 samples)
> System Call Overhead                        4104142.0 lps   (10.0 s, 7 samples)
> 
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0   47103754.6   4036.3
> Double-Precision Whetstone                       55.0       3886.3    706.6
> Execl Throughput                                 43.0       8965.0   2084.9
> File Copy 1024 bufsize 2000 maxblocks          3960.0    1510285.9   3813.9
> File Copy 256 bufsize 500 maxblocks            1655.0     395196.9   2387.9
> File Copy 4096 bufsize 8000 maxblocks          5800.0    3802788.0   6556.5
> Pipe Throughput                               12440.0    2670169.1   2146.4
> Pipe-based Context Switching                   4000.0     275093.8    687.7
> Process Creation                                126.0      19707.1   1564.1
> Shell Scripts (1 concurrent)                     42.4      16046.8   3784.6
> Shell Scripts (8 concurrent)                      6.0       5600.8   9334.6
> System Call Overhead                          15000.0    4104142.0   2736.1
>                                                                    ========
> System Benchmarks Index Score                                        2560.0
> 
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:45:04 - 14:13:17
> 8 CPUs in system; running 8 parallel copies of tests
> 
> Dhrystone 2 using register variables      237271982.6 lps   (10.0 s, 7 samples)
> Double-Precision Whetstone                    35186.8 MWIPS (10.7 s, 7 samples)
> Execl Throughput                              42557.8 lps   (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks       1403922.0 KBps  (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks          367436.5 KBps  (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks       4380468.3 KBps  (30.0 s, 2 samples)
> Pipe Throughput                            12872664.6 lps   (10.0 s, 7 samples)
> Pipe-based Context Switching                2451404.5 lps   (10.0 s, 7 samples)
> Process Creation                              97788.2 lps   (30.0 s, 2 samples)
> Shell Scripts (1 concurrent)                  58505.9 lpm   (60.0 s, 2 samples)
> Shell Scripts (8 concurrent)                   9195.4 lpm   (60.0 s, 2 samples)
> System Call Overhead                        9467372.2 lps   (10.0 s, 7 samples)
> 
> System Benchmarks Index Values               BASELINE       RESULT    INDEX
> Dhrystone 2 using register variables         116700.0  237271982.6  20331.8
> Double-Precision Whetstone                       55.0      35186.8   6397.6
> Execl Throughput                                 43.0      42557.8   9897.2
> File Copy 1024 bufsize 2000 maxblocks          3960.0    1403922.0   3545.3
> File Copy 256 bufsize 500 maxblocks            1655.0     367436.5   2220.2
> File Copy 4096 bufsize 8000 maxblocks          5800.0    4380468.3   7552.5
> Pipe Throughput                               12440.0   12872664.6  10347.8
> Pipe-based Context Switching                   4000.0    2451404.5   6128.5
> Process Creation                                126.0      97788.2   7761.0
> Shell Scripts (1 concurrent)                     42.4      58505.9  13798.6
> Shell Scripts (8 concurrent)                      6.0       9195.4  15325.6
> System Call Overhead                          15000.0    9467372.2   6311.6
>                                                                    ========
> System Benchmarks Index Score                                        7794.3

At least now, BIO_MAX_PAGES can be fixed as 256 in case of CONFIG_THP_SWAP,
otherwise 2 pages may be allocated for holding the bvec table, so tests
in case of THP_SWAP may be improved.

Also filesystem may support IO to/from THP, and multipage bvec should
improve this case too.

Long term, there is opportunity to improve fs code by only allocating
'nr_segment' of bvec table, instead of 'nr_page' of bvec table because
physically contiguous pages are often allocated from mm for same
process.

So this patchset is just a start, and at the current stage, I am
focusing on making it stable since it is the correct approach to
only store the multipage segment instead of each pages.

Thanks again for your test.

Thanks,
Ming