> > - bio size can be increased and it should improve some high-bandwidth IO > case in theory[4]. > Hi, I would like to report your patch set works well on my system based on v4.14.48. I thought the multipage bvec could improve the performance of my system. (FYI, my system has v4.14.48 and provides KVM-base virtualization service.) So I did back-porting your patches to v4.14.48. It has done without any serious problem. I only needed to cherry-pick "blk-merge: compute bio->bi_seg_front_size efficiently" and "block: move bio_alloc_pages() to bcache" patches before back-porting to prevent conflicts. And I ran my own test-suit for checking features of md and RAID1 layer. There was no problem. All test cases passed. (If you want, I will send you the back-ported patches.) Then I did two performance test as following. To say the conclusion first, I failed to show performance improvement of the patch set. Of course, my test cases would not be suitable to test your patch set. Or maybe I did test wrong. Please inform me which tools are suitable, then I will try them. 1. fio First I ran fio with null device to check the performance of the block-layer. I am not sure those test is suitable to show the performance improvement or degradation. Nevertheless there was a little (-6%) performance degradation. If it is not much trouble to you, please review my options for fio and inform me if I used wrong or incorrect options. Then I will run the test again. 1.1 Following is my options for fio. gkim@ib1:~/pb-ltp/benchmark/fio$ cat go_local.sh #!/bin/bash echo "fio start : $(date)" echo "kernel info : $(uname -a)" echo "fio version : $(fio --version)" # set "none" io-scheduler modprobe -r null_blk modprobe null_blk echo "none" > /sys/block/nullb0/queue/scheduler FIO_OPTION="--direct=1 --rw=randrw:2 --time_based=1 --group_reporting \ --ioengine=libaio --iodepth=64 --name=fiotest --numjobs=8 \ --bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 \ --fadvise_hint=0 --iodepth_batch_submit=64 --iodepth_batch_complete=64" # fio test null_blk device, so it is not necessary to run long. fio $FIO_OPTION --filename=/dev/nullb0 --runtime=600 1.2 Following is the result before porting. fio start : Mon Jun 11 04:30:01 CEST 2018 kernel info : Linux ib1 4.14.48-1-pserver #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP x86_64 GNU/Linux fio version : fio-2.2.10 fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 ... fio-2.2.10 Starting 8 processes fiotest: (groupid=0, jobs=8): err= 0: pid=1655: Mon Jun 11 04:40:02 2018 read : io=7133.2GB, bw=12174MB/s, iops=1342.1K, runt=600001msec slat (usec): min=1, max=15750, avg=123.78, stdev=153.79 clat (usec): min=0, max=15758, avg=24.70, stdev=77.93 lat (usec): min=2, max=15782, avg=148.49, stdev=167.54 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6], | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107], | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416], | 99.99th=[ 2960] bw (MB /s): min= 1096, max= 2147, per=12.51%, avg=1522.69, stdev=253.89 write: io=7131.3GB, bw=12171MB/s, iops=1343.6K, runt=600001msec slat (usec): min=1, max=15751, avg=124.73, stdev=154.11 clat (usec): min=0, max=15758, avg=24.69, stdev=77.84 lat (usec): min=2, max=15780, avg=149.43, stdev=167.82 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6], | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107], | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416], | 99.99th=[ 2960] bw (MB /s): min= 1080, max= 2121, per=12.51%, avg=1522.33, stdev=253.96 lat (usec) : 2=21.63%, 4=37.80%, 10=2.12%, 20=6.43%, 50=16.70% lat (usec) : 100=8.86%, 250=6.07%, 500=0.17%, 750=0.08%, 1000=0.05% lat (msec) : 2=0.06%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=22.39%, sys=64.19%, ctx=15425825, majf=0, minf=97 IO depths : 1=1.8%, 2=1.8%, 4=8.8%, 8=14.4%, 16=12.3%, 32=41.7%, >=64=19.3% submit : 0=0.0%, 4=5.8%, 8=9.7%, 16=15.0%, 32=18.0%, 64=51.5%, >=64=0.0% complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0% issued : total=r=805764385/w=806127393/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: io=7133.2GB, aggrb=12174MB/s, minb=12174MB/s, maxb=12174MB/s, mint=600001msec, maxt=600001msec WRITE: io=7131.3GB, aggrb=12171MB/s, minb=12171MB/s, maxb=12171MB/s, mint=600001msec, maxt=600001msec Disk stats (read/write): nullb0: ios=442461761/442546060, merge=363197836/363473703, ticks=12280990/12452480, in_queue=2740, util=0.43% 1.3 Following is the result after porting. fio start : Fri Jun 15 12:42:47 CEST 2018 kernel info : Linux ib1 4.14.48-1-pserver-mpbvec+ #12 SMP Fri Jun 15 12:21:36 CEST 2018 x86_64 GNU/Linux fio version : fio-2.2.10 fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K, ioengine=libaio, iodepth=64 ... fio-2.2.10 Starting 8 processes Jobs: 4 (f=0): [m(1),_(2),m(1),_(1),m(2),_(1)] [100.0% done] [8430MB/8444MB/0KB /s] [961K/963K/0 iops] [eta 00m:00s] fiotest: (groupid=0, jobs=8): err= 0: pid=14096: Fri Jun 15 12:52:48 2018 read : io=6633.8GB, bw=11322MB/s, iops=1246.9K, runt=600005msec slat (usec): min=1, max=16939, avg=135.34, stdev=156.23 clat (usec): min=0, max=16947, avg=26.10, stdev=78.50 lat (usec): min=2, max=16957, avg=161.45, stdev=168.88 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5], | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115], | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416], | 99.99th=[ 2864] bw (MB /s): min= 1106, max= 2031, per=12.51%, avg=1416.05, stdev=201.81 write: io=6631.1GB, bw=11318MB/s, iops=1247.5K, runt=600005msec slat (usec): min=1, max=16938, avg=136.48, stdev=156.54 clat (usec): min=0, max=16947, avg=26.08, stdev=78.43 lat (usec): min=2, max=16957, avg=162.58, stdev=169.15 clat percentiles (usec): | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1], | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5], | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115], | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416], | 99.99th=[ 2864] bw (MB /s): min= 1084, max= 2044, per=12.51%, avg=1415.67, stdev=201.93 lat (usec) : 2=20.98%, 4=38.82%, 10=2.15%, 20=5.08%, 50=16.91% lat (usec) : 100=8.75%, 250=6.91%, 500=0.19%, 750=0.09%, 1000=0.05% lat (msec) : 2=0.07%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=21.02%, sys=65.53%, ctx=15321661, majf=0, minf=78 IO depths : 1=1.9%, 2=1.9%, 4=9.5%, 8=13.6%, 16=11.2%, 32=42.1%, >=64=19.9% submit : 0=0.0%, 4=6.3%, 8=10.1%, 16=14.1%, 32=18.2%, 64=51.3%, >=64=0.0% complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0% issued : total=r=748120019/w=748454509/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: io=6633.8GB, aggrb=11322MB/s, minb=11322MB/s, maxb=11322MB/s, mint=600005msec, maxt=600005msec WRITE: io=6631.1GB, aggrb=11318MB/s, minb=11318MB/s, maxb=11318MB/s, mint=600005msec, maxt=600005msec Disk stats (read/write): nullb0: ios=410911387/410974086, merge=337127604/337396176, ticks=12482050/12662790, in_queue=1780, util=0.27% 2. Unixbench Second I rand Unixbench to check general performance. I think there is no difference before and after porting the patches. Unixbench might not be suitable to check the performance improvement of the block layer. If you inform me which tools is suitable, I will try it on my system. 2.1 Following is the result before porting. BYTE UNIX Benchmarks (Version 5.1.3) System: ib1: GNU/Linux OS: GNU/Linux -- 4.14.48-1-pserver -- #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP Machine: x86_64 (unknown) Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8") CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization 05:00:01 up 3 days, 16:20, 2 users, load average: 0.00, 0.11, 1.11; runlevel 2018-06-07 ------------------------------------------------------------------------ Benchmark Run: Mon Jun 11 2018 05:00:01 - 05:28:54 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 47158867.7 lps (10.0 s, 7 samples) Double-Precision Whetstone 3878.8 MWIPS (15.2 s, 7 samples) Execl Throughput 9203.9 lps (30.0 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1490834.8 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 388784.2 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 3744780.2 KBps (30.0 s, 2 samples) Pipe Throughput 2682620.1 lps (10.0 s, 7 samples) Pipe-based Context Switching 263786.5 lps (10.0 s, 7 samples) Process Creation 19674.0 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 16121.5 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 5623.5 lpm (60.0 s, 2 samples) System Call Overhead 4068991.3 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 47158867.7 4041.0 Double-Precision Whetstone 55.0 3878.8 705.2 Execl Throughput 43.0 9203.9 2140.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 1490834.8 3764.7 File Copy 256 bufsize 500 maxblocks 1655.0 388784.2 2349.1 File Copy 4096 bufsize 8000 maxblocks 5800.0 3744780.2 6456.5 Pipe Throughput 12440.0 2682620.1 2156.4 Pipe-based Context Switching 4000.0 263786.5 659.5 Process Creation 126.0 19674.0 1561.4 Shell Scripts (1 concurrent) 42.4 16121.5 3802.2 Shell Scripts (8 concurrent) 6.0 5623.5 9372.5 System Call Overhead 15000.0 4068991.3 2712.7 ======== System Benchmarks Index Score 2547.7 ------------------------------------------------------------------------ Benchmark Run: Mon Jun 11 2018 05:28:54 - 05:57:07 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 234727639.9 lps (10.0 s, 7 samples) Double-Precision Whetstone 35350.9 MWIPS (10.7 s, 7 samples) Execl Throughput 43811.3 lps (30.0 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1401373.1 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 366033.9 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 4360829.6 KBps (30.0 s, 2 samples) Pipe Throughput 12875165.6 lps (10.0 s, 7 samples) Pipe-based Context Switching 2431725.6 lps (10.0 s, 7 samples) Process Creation 97360.8 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 58879.6 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 9232.5 lpm (60.0 s, 2 samples) System Call Overhead 9497958.7 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 234727639.9 20113.8 Double-Precision Whetstone 55.0 35350.9 6427.4 Execl Throughput 43.0 43811.3 10188.7 File Copy 1024 bufsize 2000 maxblocks 3960.0 1401373.1 3538.8 File Copy 256 bufsize 500 maxblocks 1655.0 366033.9 2211.7 File Copy 4096 bufsize 8000 maxblocks 5800.0 4360829.6 7518.7 Pipe Throughput 12440.0 12875165.6 10349.8 Pipe-based Context Switching 4000.0 2431725.6 6079.3 Process Creation 126.0 97360.8 7727.0 Shell Scripts (1 concurrent) 42.4 58879.6 13886.7 Shell Scripts (8 concurrent) 6.0 9232.5 15387.5 System Call Overhead 15000.0 9497958.7 6332.0 ======== System Benchmarks Index Score 7803.5 2.2 Following is the result after porting. BYTE UNIX Benchmarks (Version 5.1.3) System: ib1: GNU/Linux OS: GNU/Linux -- 4.14.48-1-pserver-mpbvec+ -- #12 SMP Fri Jun 15 12:21:36 CEST 2018 Machine: x86_64 (unknown) Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8") CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips) Hyper-Threading, x86-64, MMX, Physical Address Ext, SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization 13:16:11 up 50 min, 1 user, load average: 0.00, 1.40, 3.46; runlevel 2018-06-15 ------------------------------------------------------------------------ Benchmark Run: Fri Jun 15 2018 13:16:11 - 13:45:04 8 CPUs in system; running 1 parallel copy of tests Dhrystone 2 using register variables 47103754.6 lps (10.0 s, 7 samples) Double-Precision Whetstone 3886.3 MWIPS (15.1 s, 7 samples) Execl Throughput 8965.0 lps (30.0 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1510285.9 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 395196.9 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 3802788.0 KBps (30.0 s, 2 samples) Pipe Throughput 2670169.1 lps (10.0 s, 7 samples) Pipe-based Context Switching 275093.8 lps (10.0 s, 7 samples) Process Creation 19707.1 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 16046.8 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 5600.8 lpm (60.0 s, 2 samples) System Call Overhead 4104142.0 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 47103754.6 4036.3 Double-Precision Whetstone 55.0 3886.3 706.6 Execl Throughput 43.0 8965.0 2084.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 1510285.9 3813.9 File Copy 256 bufsize 500 maxblocks 1655.0 395196.9 2387.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 3802788.0 6556.5 Pipe Throughput 12440.0 2670169.1 2146.4 Pipe-based Context Switching 4000.0 275093.8 687.7 Process Creation 126.0 19707.1 1564.1 Shell Scripts (1 concurrent) 42.4 16046.8 3784.6 Shell Scripts (8 concurrent) 6.0 5600.8 9334.6 System Call Overhead 15000.0 4104142.0 2736.1 ======== System Benchmarks Index Score 2560.0 ------------------------------------------------------------------------ Benchmark Run: Fri Jun 15 2018 13:45:04 - 14:13:17 8 CPUs in system; running 8 parallel copies of tests Dhrystone 2 using register variables 237271982.6 lps (10.0 s, 7 samples) Double-Precision Whetstone 35186.8 MWIPS (10.7 s, 7 samples) Execl Throughput 42557.8 lps (30.0 s, 2 samples) File Copy 1024 bufsize 2000 maxblocks 1403922.0 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 367436.5 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 4380468.3 KBps (30.0 s, 2 samples) Pipe Throughput 12872664.6 lps (10.0 s, 7 samples) Pipe-based Context Switching 2451404.5 lps (10.0 s, 7 samples) Process Creation 97788.2 lps (30.0 s, 2 samples) Shell Scripts (1 concurrent) 58505.9 lpm (60.0 s, 2 samples) Shell Scripts (8 concurrent) 9195.4 lpm (60.0 s, 2 samples) System Call Overhead 9467372.2 lps (10.0 s, 7 samples) System Benchmarks Index Values BASELINE RESULT INDEX Dhrystone 2 using register variables 116700.0 237271982.6 20331.8 Double-Precision Whetstone 55.0 35186.8 6397.6 Execl Throughput 43.0 42557.8 9897.2 File Copy 1024 bufsize 2000 maxblocks 3960.0 1403922.0 3545.3 File Copy 256 bufsize 500 maxblocks 1655.0 367436.5 2220.2 File Copy 4096 bufsize 8000 maxblocks 5800.0 4380468.3 7552.5 Pipe Throughput 12440.0 12872664.6 10347.8 Pipe-based Context Switching 4000.0 2451404.5 6128.5 Process Creation 126.0 97788.2 7761.0 Shell Scripts (1 concurrent) 42.4 58505.9 13798.6 Shell Scripts (8 concurrent) 6.0 9195.4 15325.6 System Call Overhead 15000.0 9467372.2 6311.6 ======== System Benchmarks Index Score 7794.3 -- GIOH KIM Linux Kernel Entwickler ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 176 2697 8962 Fax: +49 30 577 008 299 Email: gi-oh.kim@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss, Matthias Steinberg, Christoph Steffens