Matt, So you are up at 1GB/sec, which is only 1/4 the degraded speed, but 1/2 the expected speed based on drive data transfers required. This is actually pretty good. I should have mentioned the stripe cache parameter before, but I use raid "differently" and stripe cache does not impact my use case. Sorry. The 1GB/sec saturating a core is probably as good as it gets. This core is doing a lot of stripe cache page manipulations which are not all that fast. Also, the single parity recovery case should be XOR and not the raid-6 logic, so it should be pretty cheap. Another, not important point for this issue, is that the benchmarks are to generate parity, not recover. Recovery with raid-6 (ie, two drives failed) is more expensive that the writes. I am not sure how optimized this is, but it could be really bad. If you need this to go faster, then it is either a raid re-design, or perhaps you should consider cutting your array into two parts. Two 12 drives raid-6 arrays will give you more bandwidth both because the failures are less "wide", so a single drive will only do 11 reads instead of 22. Plus you get the benefit of two raid-6 threads should you have dead drives on both halves. You can raid-0 the arrays together. Then again, you lose two drives worth of space. Doug On Tue, Aug 23, 2016 at 12:26 PM, Matt Garman <matthew.garman@xxxxxxxxx> wrote: > Doug & Doug, > > Thank you for your helpful replies. I merged both of your posts into > one, see inline comments below: > > On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote: >> Of course. I didn't mean to imply otherwise. The read size is the read >> size. But, since the OPs test case was to "read random files" and not >> "read random blocks of random files" I took it to mean it would be >> sequential IO across a multitude of random files. That assumption might >> have been wrong, but I wrote my explanation with that in mind. > > Yes, multiple parallel sequential reads. Our test program generates a > bunch of big random files (file size has an approximately normal > distribution, centered around 500 MB, going down to 100 MB or so, up > to a few multi-GB outliers). The file generation is a one-time thing, > and we don't really care about its performance. > > The read testing program just randomly picks one of those files, then > reads it start-to-finish using "dd". But it kicks off several "dd" > threads at once (currently 50, though this is a run-time parameter). > This is how we generate the read load, and I use iostat while this is > running to see how much read throughput I'm getting from the array. > > > On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote: >> This depends a lot on how you structured your raid array. I didn't see >> your earlier emails, so I'm inferring from the "one out of 22 reads will >> be to the bad drive" that you have a 24 disk raid6 array? If so, then >> that's 22 data disks and 2 parity disks per stripe. I'm gonna use that >> as the basis for my next statement even if it's slightly wrong. > > Yes, that is exactly correct, here's the relevant part of /proc/mdstat: > > Personalities : [raid1] [raid6] [raid5] [raid4] > > md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13] > sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19] > sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6] > > 44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU] > > bitmap: 0/15 pages [0KB], 65536KB chunk > > >> Doug was right in that you will have to read 21 data disks and 1 parity >> disk to reconstruct reads from the missing block of any given stripe. >> And while he is also correct that this doubles IO ops needed to get your >> read data, it doesn't address the XOR load to get your data. With 19 >> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR >> 20 64k data blocks for 1 result. If you are getting 200MB/s, you are >> actually achieving more like 390MB/s of data read, with 190MB/s of it >> being direct reads, and then you are using XOR on 200MB/s in order to >> generate the other 10MB/s of results. > > Most of this morning I've been setting/unsetting/changing various > tunables, to see if I could increase the read speed. I got a huge > boost by increasing the /sys/block/md0/md/stripe_cache_size parameter > from the default (256 IIRC) to 16384. Doubling it again to 32k didn't > seem to bring any further benefit. So with the stripe_cache_size > increased to 16k, I'm now getting around 1000 MB/s read in the > degraded state. When the degraded array was only doing 200 MB/s, the > md0_raid6 process was taking about 50% CPU according to top. Now I > have a 5x increase in read speed, and md0_raid6 is taking 100% CPU. > I'm still degraded by a factor of eight, though, where I'd expect only > two. > >> 1) 200MB/s of XOR is not insignificant. Due to our single thread XOR >> routines, you can actually keep a CPU pretty busy with this. Also, even >> though the XOR routines try to time their assembly 'just so' so that >> they can use the cache avoiding instructions, this fails more often than >> not so you end up blowing CPU caches while doing this work, which of >> course effects the overall system. > > While 200 MB/s of XOR sounds high, the kernel is "advertising" over > 8000 MB/s, per dmesg: > > [ 6.386820] xor: automatically using best checksumming function: > [ 6.396690] avx : 24064.000 MB/sec > [ 6.414706] raid6: sse2x1 gen() 7636 MB/s > [ 6.431725] raid6: sse2x2 gen() 3656 MB/s > [ 6.448742] raid6: sse2x4 gen() 3917 MB/s > [ 6.465753] raid6: avx2x1 gen() 5425 MB/s > [ 6.482766] raid6: avx2x2 gen() 7593 MB/s > [ 6.499773] raid6: avx2x4 gen() 8648 MB/s > [ 6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s) > [ 6.499774] raid6: using avx2x2 recovery algorithm > > (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.) > > I'm assuming however the kernel does its testing is fairly optimal, > and probably assumes ideal cache behavior... so maybe actual XOR > performance won't be as good as what dmesg suggests... but still, 200 > MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000 > MB/s... > > Is it possible to pin kernel threads to a CPU? I'm thinking I could > reboot with isolcpus=2 (for example) and if I can force that md0_raid6 > thread to run on CPU 2, at least the L1/L2 caches should be minimally > affected... > >> Possible fixes for this might include: >> c) Creating a consolidated page cache/stripe cache (if we can read more >> of the blocks needed to get our data from cache instead of disk it helps >> reduce that IO ops issue) > > I suppose this might be an explanation for why increasing the array's > stripe_cache_size gave me such a boost? > >> d) Rearchitecting your arrays into raid50 instead of big raid6 array > > My colleague tested that exact same config with hardware raid5, and > striped the three raid5 arrays together with software raid1. So > clearly not apples-to-apples, but he did get dramatically better > degraded and rebuild performance. I do intend to test a pure software > raid-50 implementation. > >> (or conversely has the random head seeks just gone so >> radically through the roof that the problem here really is the time it >> takes the heads to travel everywhere we are sending them). > > I'm certain head movement time isn't the issue, as these are SSDs. :) > > On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@xxxxxxxxxx> wrote: >> Can you run an iostat during your degraded test, and also a top run >> over 20+ seconds with kernel threads showing up. Even better would be >> a perf capture, but you might not have all the tools installed. You >> can always try: >> >> perf record -a sleep 20 >> >> then >> >> perf report >> >> should show you the top functions globally over the 20 second sample. >> If you don't have perf loaded, you might (or might not) be able to >> load it from the distro. > > Running top for 20 or more seconds, the top processes in terms of CPU > usage are pretty static: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 1228 root 20 0 0 0 0 R 100.0 0.0 562:16.83 md0_raid6 > 1315 root 20 0 4372 684 524 S 17.3 0.0 57:20.92 rngd > 107 root 20 0 0 0 0 S 9.6 0.0 65:16.63 kswapd0 > 108 root 20 0 0 0 0 S 8.6 0.0 65:19.58 kswapd1 > 19424 root 20 0 108972 1676 560 D 3.3 0.0 0:00.52 dd > 6909 root 20 0 108972 1676 560 D 2.7 0.0 0:01.53 dd > 18383 root 20 0 108972 1680 560 D 2.7 0.0 0:00.63 dd > > > I truncated the output. The "dd" processes are part of our testing > tool that generates the huge read load on the array. Any given "dd" > process might jump around, but those four kernel processes are always > the top four. (Note that before I increased the stripe_cache_size (as > mentioned above), the md0_raid6 process was only consuming around 50% > CPU.) > > Here is a representative view of a non-first iteration of "iostat -mxt 5": > > > 08/23/2016 01:37:59 PM > avg-cpu: %user %nice %system %iowait %steal %idle > 4.84 0.00 27.41 67.59 0.00 0.17 > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > sdy 0.00 0.40 0.80 0.60 0.05 0.00 > 83.43 0.00 1.00 0.50 1.67 1.00 0.14 > sdz 0.00 0.40 0.00 0.60 0.00 0.00 > 10.67 0.00 2.00 0.00 2.00 2.00 0.12 > sdd 12927.00 0.00 204.40 0.00 51.00 0.00 > 511.00 5.93 28.75 28.75 0.00 4.31 88.10 > sde 13002.60 0.00 205.20 0.00 51.20 0.00 > 511.00 6.29 30.39 30.39 0.00 4.59 94.12 > sdf 12976.80 0.00 205.00 0.00 51.00 0.00 > 509.50 6.17 29.76 29.76 0.00 4.57 93.78 > sdg 12950.20 0.00 205.60 0.00 50.80 0.00 > 506.03 6.20 29.75 29.75 0.00 4.57 93.88 > sdh 12949.00 0.00 207.20 0.00 50.90 0.00 > 503.11 6.36 30.35 30.35 0.00 4.59 95.10 > sdb 12196.40 0.00 192.60 0.00 48.10 0.00 > 511.47 5.48 28.15 28.15 0.00 4.38 84.36 > sda 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > sdi 12923.00 0.00 208.40 0.00 51.00 0.00 > 501.20 6.79 32.31 32.31 0.00 4.65 96.84 > sdj 12796.20 0.00 206.80 0.00 50.50 0.00 > 500.12 6.62 31.73 31.73 0.00 4.62 95.64 > sdk 12746.60 0.00 204.00 0.00 50.20 0.00 > 503.97 6.38 30.77 30.77 0.00 4.60 93.86 > sdl 12570.00 0.00 202.20 0.00 49.70 0.00 > 503.39 6.39 31.19 31.19 0.00 4.63 93.68 > sdn 12594.00 0.00 204.20 0.00 49.95 0.00 > 500.97 6.40 30.99 30.99 0.00 4.58 93.54 > sdm 12569.00 0.00 203.80 0.00 49.90 0.00 > 501.45 6.30 30.58 30.58 0.00 4.45 90.60 > sdp 12568.80 0.00 205.20 0.00 50.10 0.00 > 500.03 6.37 30.79 30.79 0.00 4.52 92.72 > sdo 12569.20 0.00 204.00 0.00 49.95 0.00 > 501.46 6.40 31.07 31.07 0.00 4.58 93.42 > sdw 12568.60 0.00 206.20 0.00 50.00 0.00 > 496.60 6.34 30.71 30.71 0.00 4.24 87.48 > sdx 12038.60 0.00 197.40 0.00 47.60 0.00 > 493.84 6.01 30.21 30.21 0.00 4.40 86.86 > sdq 12570.20 0.00 204.20 0.00 50.15 0.00 > 502.97 6.23 30.41 30.41 0.00 4.44 90.68 > sdr 12571.00 0.00 204.60 0.00 50.25 0.00 > 502.99 6.15 30.26 30.26 0.00 4.18 85.62 > sds 12495.20 0.00 203.80 0.00 49.95 0.00 > 501.95 6.00 29.62 29.62 0.00 4.24 86.38 > sdu 12695.60 0.00 207.80 0.00 50.65 0.00 > 499.17 6.22 30.00 30.00 0.00 4.16 86.38 > sdv 12619.00 0.00 207.80 0.00 50.35 0.00 > 496.22 6.23 30.03 30.03 0.00 4.20 87.32 > sdt 12671.20 0.00 206.20 0.00 50.50 0.00 > 501.56 6.05 29.30 29.30 0.00 4.24 87.44 > sdc 12851.60 0.00 203.00 0.00 50.70 0.00 > 511.50 5.84 28.49 28.49 0.00 4.17 84.64 > md126 0.00 0.00 0.60 1.00 0.05 0.00 > 71.00 0.00 0.00 0.00 0.00 0.00 0.00 > dm-0 0.00 0.00 0.60 0.80 0.05 0.00 > 81.14 0.00 2.29 0.67 3.50 1.14 0.16 > dm-1 0.00 0.00 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > md0 0.00 0.00 4475.20 0.00 1110.95 0.00 > 508.41 0.00 0.00 0.00 0.00 0.00 0.00 > > > sdy and sz are the system drives, so they are uninteresting. > > sda is the md0 drive I failed, that's why it stays at zero. > > And lastly, here's the output of the perf commands you suggested (at > least the top part): > > Samples: 561K of event 'cycles', Event count (approx.): 318536644203 > Overhead Command Shared Object Symbol > 52.85% swapper [kernel.kallsyms] [k] cpu_startup_entry > 4.47% md0_raid6 [kernel.kallsyms] [k] memcpy > 3.39% dd [kernel.kallsyms] [k] __find_stripe > 2.50% md0_raid6 [kernel.kallsyms] [k] analyse_stripe > 2.43% dd [kernel.kallsyms] [k] _raw_spin_lock_irq > 1.75% rngd rngd [.] 0x000000000000288b > 1.74% md0_raid6 [kernel.kallsyms] [k] xor_avx_5 > 1.49% dd [kernel.kallsyms] [k] > copy_user_enhanced_fast_string > 1.33% md0_raid6 [kernel.kallsyms] [k] ops_run_io > 0.65% dd [kernel.kallsyms] [k] raid5_compute_sector > 0.60% md0_raid6 [kernel.kallsyms] [k] _raw_spin_lock_irq > 0.55% ps libc-2.17.so [.] _IO_vfscanf > 0.53% ps [kernel.kallsyms] [k] vsnprintf > 0.51% ps [kernel.kallsyms] [k] format_decode > 0.47% ps [kernel.kallsyms] [k] number.isra.2 > 0.41% md0_raid6 [kernel.kallsyms] [k] raid_run_ops > 0.40% md0_raid6 [kernel.kallsyms] [k] __blk_segment_map_sg > > > That's my first time using the perf tool, so I need a little hand-holding here. > > Thanks again all! > Matt -- Doug Dumitru EasyCo LLC -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html