Hi Jeff Thanks for testing. It would be interesting ... what happens if you take the patch 3, leave "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct block_device", but remove any use of the semaphore from fs/block_dev.c? - will the performance be like unpatched kernel or like patch 3? It could be that the change in the alignment affects performance on your CPU too, just differently than on my CPU. What is the CPU model that you used for testing? Mikulas On Mon, 17 Sep 2012, Jeff Moyer wrote: > Jeff Moyer <jmoyer@xxxxxxxxxx> writes: > > Mikulas Patocka <mpatocka@xxxxxxxxxx> writes: > >> I would be interested if other people did performance testing of the > >> patches too. > > > > I'll do some testing next week, but don't expect to get to it before > > Wednesday. > > Sorry for taking so long on this. I managed to get access to an 80cpu > (160 threads) system with 1TB of memory. I installed a pcie ssd into > this machine and did some testing against the raw block device. > > I've attached the fio job file I used. Basically, I tested sequential > reads, sequential writes, random reads, random writes, and then a mix of > sequential reads and writes, and a mix of random reads and writes. All > tests used direct I/O to the block device, and each number shown is an > average of 5 runs. I had to pin the fio processes to the same numa node > as the pcie adapter in order to get low run-to-run variations. Because > of the numa factor, I was unable to get reliable results running > processes against all of the 160 threads on the system. The runs below > have 4 processes, each pushing a queue depth of 1024. > > So, on to the results. I haven't fully investigated them yet, but I > plan to as they are rather surprising. > > The first patch in the series simply adds a semaphore to the > block_device structure. Mikulas, you had mentioned that this managed to > have a large effect on your test load. In my case, this didn't seem to > make any difference at all: > > 3.6.0-rc5+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00 > read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00 > randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00 > randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00 > readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00 > randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00 > 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00 > read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00 > randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00 > randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00 > readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00 > randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00 > %diff > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 0 0 0 0.00 0.00 -6.91 > read1 0 0 0 0 0 0 0.00 0.00 -8.27 > randwrite1 0 0 0 0 0 0 0.00 0.00 -8.94 > randread1 0 0 0 0 0 0 0.00 0.00 -5.00 > readwrite1 0 0 0 0 0 0 0.00 0.00 -10.35 > randrw1 0 0 0 0 0 0 0.00 0.00 -5.14 > > > The headings are: > BW = bandwidth in KB/s > IOPS = I/Os per second > msec = number of miliseconds the run took (smaller is better) > usr = %user time > sys = %system time > csw = context switches > > The first two tables show the results of each run. In this case, the > first is the unpatched kernel, and the second is the one with the > block_device structure change. The third table is the % difference > between the two. A positive number indicates the second run had a > larger average than the first. I found that the context switch rate was > rather unpredictable, so I really should have just left that out of the > reporting. > > As you can see, adding a member to struct block_device did not really > change the results. > > > Next up is the patch that actually uses the rw semaphore to protect > access to the block size. Here are the results: > > 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 764037 191009 43925 17.14 60.15 3536950.00 > read1 696880 174220 48152 0 0 0 13.90 61.74 3710168.00 > randwrite1 0 0 0 737331 184332 45511 29.82 52.71 2869440.00 > randread1 689319 172329 48684 0 0 0 26.38 54.58 2927411.00 > readwrite1 387651 96912 43294 387799 96949 43294 16.06 64.92 2814340.00 > randrw1 360298 90074 46591 360304 90075 46591 28.53 54.10 2793120.00 > 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00 > read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00 > randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00 > randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00 > readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00 > randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00 > %diff > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 6 6 -6 0.00 0.00 -10.67 > read1 7 7 -6 0 0 0 0.00 0.00 -14.02 > randwrite1 0 0 0 0 0 0 0.00 0.00 -8.78 > randread1 6 6 -6 0 0 0 0.00 0.00 -17.91 > readwrite1 0 0 0 0 0 0 -9.71 0.00 22.95 > randrw1 0 0 0 0 0 0 0.00 0.00 0.00 > > As you can see, there were modest gains in write, read, and randread. > This is somewhat unexpected, as you would think that introducing locking > would not *help* performance! Investigating the standard deviations for > each set of 5 runs shows that the performance difference is significant > (the standard deviation is reported as a percentage of the average): > > This is a table of standard deviations for the 5 runs comprising the > above average with this kernel: 3.6.0-rc5.mikulas.1+ > > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 1 1 1 2.99 1.27 9.10 > read1 0 0 0 0 0 0 2.12 0.53 5.03 > randwrite1 0 0 0 0 0 0 1.25 0.49 5.52 > randread1 1 1 1 0 0 0 1.81 1.18 10.04 > readwrite1 2 2 2 2 2 2 11.35 1.86 26.83 > randrw1 2 2 2 2 2 2 4.01 2.71 22.72 > > And here are the standard deviations for the .2+ kernel: > > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 2 2 2 3.33 2.95 7.88 > read1 2 2 2 0 0 0 6.44 2.30 19.27 > randwrite1 0 0 0 3 3 3 0.18 0.52 1.71 > randread1 2 2 2 0 0 0 3.72 2.34 23.70 > readwrite1 3 3 3 3 3 3 3.35 2.61 7.38 > randrw1 1 1 1 1 1 1 1.80 1.00 9.73 > > > Next, we'll move on to the third patch in the series, which converts the > rw semaphore to a per-cpu semaphore. > > 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 816713 204178 41108 16.60 62.06 3159574.00 > read1 749437 187359 44800 0 0 0 13.91 63.69 3190050.00 > randwrite1 0 0 0 747534 186883 44941 29.96 53.23 2617590.00 > randread1 734627 183656 45699 0 0 0 27.02 56.27 2403191.00 > readwrite1 396113 99027 42397 396120 99029 42397 14.50 63.21 3460140.00 > randrw1 374408 93601 44806 374556 93638 44806 28.46 54.33 2688985.00 > 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 > read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 > randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 > randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 > readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 > randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 > %diff > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 6 6 -6 7.41 -33.02 -46.26 > read1 90 90 -47 0 0 0 0.00 -12.07 -92.44 > randwrite1 0 0 0 0 0 0 5.07 -31.69 -52.02 > randread1 42 42 -29 0 0 0 18.02 -21.11 -92.30 > readwrite1 74 74 -42 74 74 -42 28.62 -15.14 -91.00 > randrw1 33 33 -25 33 33 -25 24.39 -23.41 -90.84 > > Wow! Switching to the per-cpu semaphore implementation just boosted the > performance of the I/O path big-time. Note that the system time also > goes down! So, we get better throughput and less system time. This > sounds too good to be true. ;-) Here are the standard deviations > (again, shown as percentages) for the .3+ kernel: > > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 0 0 0 0.96 0.19 1.03 > read1 0 0 0 0 0 0 1.82 0.24 2.46 > randwrite1 0 0 0 0 0 0 0.40 0.39 0.68 > randread1 0 0 0 0 0 0 0.53 0.31 2.02 > readwrite1 0 0 0 0 0 0 2.73 4.07 33.27 > randrw1 1 1 1 1 1 1 0.40 0.10 3.29 > > Again, there's no slop there, so the results are very reproducible. > > Finally, the last patch changes to an rcu-based rw semaphore > implementation. Here are the results for that, as compared with the > previous kernel: > > 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 > read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 > randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 > randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 > readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 > randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 > 3.6.0-rc5.mikulas.4+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 812659 203164 41309 16.80 61.71 3208620.00 > read1 739061 184765 45442 0 0 0 14.32 62.85 3375484.00 > randwrite1 0 0 0 726971 181742 46192 30.00 52.33 2736270.00 > randread1 719040 179760 46683 0 0 0 26.47 54.78 2914080.00 > readwrite1 396670 99167 42309 396619 99154 42309 14.91 63.12 3412220.00 > randrw1 374790 93697 44766 374807 93701 44766 28.42 54.10 2774690.00 > %diff > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 -6 -6 7 -5.78 48.45 88.98 > read1 -48 -48 93 0 0 0 0.00 12.23 1298.79 > randwrite1 0 0 0 0 0 0 0.00 43.92 117.85 > randread1 -31 -31 45 0 0 0 -17.00 23.41 1474.91 > readwrite1 -42 -42 74 -42 -42 74 -20.05 17.67 996.28 > randrw1 -25 -25 33 -25 -25 33 -19.72 30.02 > 1026.13 > > And we've lost a good bit of performance! Talk about > counter-intuitive. Here are the standard deviation numbers: > > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 2 2 2 2.96 3.00 6.79 > read1 3 3 3 0 0 0 6.52 2.82 21.86 > randwrite1 0 0 0 2 2 2 0.71 0.55 4.07 > randread1 1 1 1 0 0 0 4.13 2.31 20.12 > readwrite1 1 1 1 1 1 1 4.14 2.64 6.12 > randrw1 0 0 0 0 0 0 0.59 0.25 2.99 > > > Here is a comparison of the vanilla kernel versus the best performing > patch in this series (patch 3 of 4): > > 3.6.0-rc5+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 748522 187130 44864 16.34 60.65 3799440.00 > read1 690615 172653 48602 0 0 0 13.45 61.42 4044720.00 > randwrite1 0 0 0 716406 179101 46839 29.03 52.79 3151140.00 > randread1 683466 170866 49108 0 0 0 25.92 54.67 3081610.00 > readwrite1 377518 94379 44450 377645 94410 44450 15.49 64.32 3139240.00 > randrw1 355815 88953 47178 355733 88933 47178 27.96 54.24 2944570.00 > 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 870892 217723 38528 17.83 41.57 1697870.00 > read1 1430164 357541 23462 0 0 0 14.41 56.00 241315.00 > randwrite1 0 0 0 759789 189947 44163 31.48 36.36 1256040.00 > randread1 1043830 260958 32146 0 0 0 31.89 44.39 185032.00 > readwrite1 692567 173141 24226 692489 173122 24226 18.65 53.64 311255.00 > randrw1 501208 125302 33469 501446 125361 33469 35.40 41.61 246391.00 > %diff > READ WRITE CPU > Job Name BW IOPS msec BW IOPS msec usr sys csw > write1 0 0 0 16 16 -14 9.12 -31.46 -55.31 > read1 107 107 -51 0 0 0 7.14 -8.82 -94.03 > randwrite1 0 0 0 6 6 -5 8.44 -31.12 -60.14 > randread1 52 52 -34 0 0 0 23.03 -18.80 -94.00 > readwrite1 83 83 -45 83 83 -45 20.40 -16.60 -90.09 > randrw1 40 40 -29 40 40 -29 26.61 -23.29 -91.63 > > > Next up, I'm going to get some perf and blktrace data from these runs to > see if I can identify why there is such a drastic change in > performance. I will also attempt to run the tests against a different > vendor's adapter, and maybe against some FC storage if I can set that up. > > Cheers, > Jeff > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html