Jeff Moyer <jmoyer@xxxxxxxxxx> writes: > Mikulas Patocka <mpatocka@xxxxxxxxxx> writes: >> I would be interested if other people did performance testing of the >> patches too. > > I'll do some testing next week, but don't expect to get to it before > Wednesday. Sorry for taking so long on this. I managed to get access to an 80cpu (160 threads) system with 1TB of memory. I installed a pcie ssd into this machine and did some testing against the raw block device. I've attached the fio job file I used. Basically, I tested sequential reads, sequential writes, random reads, random writes, and then a mix of sequential reads and writes, and a mix of random reads and writes. All tests used direct I/O to the block device, and each number shown is an average of 5 runs. I had to pin the fio processes to the same numa node as the pcie adapter in order to get low run-to-run variations. Because of the numa factor, I was unable to get reliable results running processes against all of the 160 threads on the system. The runs below have 4 processes, each pushing a queue depth of 1024. So, on to the results. I haven't fully investigated them yet, but I plan to as they are rather surprising. The first patch in the series simply adds a semaphore to the block_device structure. Mikulas, you had mentioned that this managed to have a large effect on your test load. In my case, this didn't seem to make any difference at all: 3.6.0-rc5+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00 read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00 randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00 randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00 readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00 randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00 read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00 randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00 randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00 readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00 randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00 %diff READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 0 0 0 0.00 0.00 -6.91 read1 0 0 0 0 0 0 0.00 0.00 -8.27 randwrite1 0 0 0 0 0 0 0.00 0.00 -8.94 randread1 0 0 0 0 0 0 0.00 0.00 -5.00 readwrite1 0 0 0 0 0 0 0.00 0.00 -10.35 randrw1 0 0 0 0 0 0 0.00 0.00 -5.14 The headings are: BW = bandwidth in KB/s IOPS = I/Os per second msec = number of miliseconds the run took (smaller is better) usr = %user time sys = %system time csw = context switches The first two tables show the results of each run. In this case, the first is the unpatched kernel, and the second is the one with the block_device structure change. The third table is the % difference between the two. A positive number indicates the second run had a larger average than the first. I found that the context switch rate was rather unpredictable, so I really should have just left that out of the reporting. As you can see, adding a member to struct block_device did not really change the results. Next up is the patch that actually uses the rw semaphore to protect access to the block size. Here are the results: 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00 read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00 randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00 randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00 readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00 randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00 read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00 randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00 randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00 readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00 randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00 %diff READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 6 6 -6 0.00 0.00 -10.67 read1 7 7 -6 0 0 0 0.00 0.00 -14.02 randwrite1 0 0 0 0 0 0 0.00 0.00 -8.78 randread1 6 6 -6 0 0 0 0.00 0.00 -17.91 readwrite1 0 0 0 0 0 0 -9.71 0.00 22.95 randrw1 0 0 0 0 0 0 0.00 0.00 0.00 As you can see, there were modest gains in write, read, and randread. This is somewhat unexpected, as you would think that introducing locking would not *help* performance! Investigating the standard deviations for each set of 5 runs shows that the performance difference is significant (the standard deviation is reported as a percentage of the average): This is a table of standard deviations for the 5 runs comprising the above average with this kernel: 3.6.0-rc5.mikulas.1+ READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 1 1 1 2.99 1.27 9.10 read1 0 0 0 0 0 0 2.12 0.53 5.03 randwrite1 0 0 0 0 0 0 1.25 0.49 5.52 randread1 1 1 1 0 0 0 1.81 1.18 10.04 readwrite1 2 2 2 2 2 2 11.35 1.86 26.83 randrw1 2 2 2 2 2 2 4.01 2.71 22.72 And here are the standard deviations for the .2+ kernel: READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 2 2 2 3.33 2.95 7.88 read1 2 2 2 0 0 0 6.44 2.30 19.27 randwrite1 0 0 0 3 3 3 0.18 0.52 1.71 randread1 2 2 2 0 0 0 3.72 2.34 23.70 readwrite1 3 3 3 3 3 3 3.35 2.61 7.38 randrw1 1 1 1 1 1 1 1.80 1.00 9.73 Next, we'll move on to the third patch in the series, which converts the rw semaphore to a per-cpu semaphore. 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00 read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00 randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00 randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00 readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00 randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 %diff READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 6 6 -6 7.41 -33.02 -46.26 read1 90 90 -47 0 0 0 0.00 -12.07 -92.44 randwrite1 0 0 0 0 0 0 5.07 -31.69 -52.02 randread1 42 42 -29 0 0 0 18.02 -21.11 -92.30 readwrite1 74 74 -42 74 74 -42 28.62 -15.14 -91.00 randrw1 33 33 -25 33 33 -25 24.39 -23.41 -90.84 Wow! Switching to the per-cpu semaphore implementation just boosted the performance of the I/O path big-time. Note that the system time also goes down! So, we get better throughput and less system time. This sounds too good to be true. ;-) Here are the standard deviations (again, shown as percentages) for the .3+ kernel: READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 0 0 0 0.96 0.19 1.03 read1 0 0 0 0 0 0 1.82 0.24 2.46 randwrite1 0 0 0 0 0 0 0.40 0.39 0.68 randread1 0 0 0 0 0 0 0.53 0.31 2.02 readwrite1 0 0 0 0 0 0 2.73 4.07 33.27 randrw1 1 1 1 1 1 1 0.40 0.10 3.29 Again, there's no slop there, so the results are very reproducible. Finally, the last patch changes to an rcu-based rw semaphore implementation. Here are the results for that, as compared with the previous kernel: 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 3.6.0-rc5.mikulas.4+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 812659 203164 41309 16.80 61.71 3208620.00 read1 739061 184765 45442 0 0 0 14.32 62.85 3375484.00 randwrite1 0 0 0 726971 181742 46192 30.00 52.33 2736270.00 randread1 719040 179760 46683 0 0 0 26.47 54.78 2914080.00 readwrite1 396670 99167 42309 396619 99154 42309 14.91 63.12 3412220.00 randrw1 374790 93697 44766 374807 93701 44766 28.42 54.10 2774690.00 %diff READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 -6 -6 7 -5.78 48.45 88.98 read1 -48 -48 93 0 0 0 0.00 12.23 1298.79 randwrite1 0 0 0 0 0 0 0.00 43.92 117.85 randread1 -31 -31 45 0 0 0 -17.00 23.41 1474.91 readwrite1 -42 -42 74 -42 -42 74 -20.05 17.67 996.28 randrw1 -25 -25 33 -25 -25 33 -19.72 30.02 1026.13 And we've lost a good bit of performance! Talk about counter-intuitive. Here are the standard deviation numbers: READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 2 2 2 2.96 3.00 6.79 read1 3 3 3 0 0 0 6.52 2.82 21.86 randwrite1 0 0 0 2 2 2 0.71 0.55 4.07 randread1 1 1 1 0 0 0 4.13 2.31 20.12 readwrite1 1 1 1 1 1 1 4.14 2.64 6.12 randrw1 0 0 0 0 0 0 0.59 0.25 2.99 Here is a comparison of the vanilla kernel versus the best performing patch in this series (patch 3 of 4): 3.6.0-rc5+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00 read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00 randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00 randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00 readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00 randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 %diff READ WRITE CPU Job Name BW IOPS msec BW IOPS msec usr sys csw write1 0 0 0 16 16 -14 9.12 -31.46 -55.31 read1 107 107 -51 0 0 0 7.14 -8.82 -94.03 randwrite1 0 0 0 6 6 -5 8.44 -31.12 -60.14 randread1 52 52 -34 0 0 0 23.03 -18.80 -94.00 readwrite1 83 83 -45 83 83 -45 20.40 -16.60 -90.09 randrw1 40 40 -29 40 40 -29 26.61 -23.29 -91.63 Next up, I'm going to get some perf and blktrace data from these runs to see if I can identify why there is such a drastic change in performance. I will also attempt to run the tests against a different vendor's adapter, and maybe against some FC storage if I can set that up. Cheers, Jeff
[global] ioengine=libaio direct=1 iodepth=1024 iodepth_batch=32 iodepth_batch_complete=1 blocksize=4k filename=/dev/XXX size=8g group_reporting=1 readwrite=write [write1] offset=0 [write2] offset=8g [write3] offset=16g [write4] offset=24g [global] readwrite=read [read1] stonewall offset=0 [read2] offset=8g [read3] offset=16g [read4] offset=24g [global] readwrite=randwrite [randwrite1] stonewall offset=0 [randwrite2] offset=8g [randwrite3] offset=16g [randwrite4] offset=24g [global] readwrite=randread [randread1] stonewall offset=0 [randread2] offset=8g [randread3] offset=16g [randread4] offset=24g [global] readwrite=readwrite [readwrite1] stonewall offset=0 [readwrite2] offset=8g [readwrite3] offset=16g [readwrite4] offset=24g [global] readwrite=randrw [randrw1] stonewall offset=0 [randrw2] offset=8g [randrw3] offset=16g [randrw4] offset=24g