Re: kernel checksumming performance vs actual raid device performance

Matt Garman <matthew.garman@xxxxxxxxx> · Tue, 23 Aug 2016 14:26:04 -0500

Doug & Doug,

Thank you for your helpful replies.  I merged both of your posts into
one, see inline comments below:

On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
> Of course.  I didn't mean to imply otherwise.  The read size is the read
> size.  But, since the OPs test case was to "read random files" and not
> "read random blocks of random files" I took it to mean it would be
> sequential IO across a multitude of random files.  That assumption might
> have been wrong, but I wrote my explanation with that in mind.

Yes, multiple parallel sequential reads.  Our test program generates a
bunch of big random files (file size has an approximately normal
distribution, centered around 500 MB, going down to 100 MB or so, up
to a few multi-GB outliers).  The file generation is a one-time thing,
and we don't really care about its performance.

The read testing program just randomly picks one of those files, then
reads it start-to-finish using "dd".  But it kicks off several "dd"
threads at once (currently 50, though this is a run-time parameter).
This is how we generate the read load, and I use iostat while this is
running to see how much read throughput I'm getting from the array.

On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
> This depends a lot on how you structured your raid array.  I didn't see
> your earlier emails, so I'm inferring from the "one out of 22 reads will
> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
> as the basis for my next statement even if it's slightly wrong.

Yes, that is exactly correct, here's the relevant part of /proc/mdstat:

Personalities : [raid1] [raid6] [raid5] [raid4]

md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]

      44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
[24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]

      bitmap: 0/15 pages [0KB], 65536KB chunk

> Doug was right in that you will have to read 21 data disks and 1 parity
> disk to reconstruct reads from the missing block of any given stripe.
> And while he is also correct that this doubles IO ops needed to get your
> read data, it doesn't address the XOR load to get your data.  With 19
> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
> actually achieving more like 390MB/s of data read, with 190MB/s of it
> being direct reads, and then you are using XOR on 200MB/s in order to
> generate the other 10MB/s of results.

Most of this morning I've been setting/unsetting/changing various
tunables, to see if I could increase the read speed.  I got a huge
boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
seem to bring any further benefit.  So with the stripe_cache_size
increased to 16k, I'm now getting around 1000 MB/s read in the
degraded state.  When the degraded array was only doing 200 MB/s, the
md0_raid6 process was taking about 50% CPU according to top.  Now I
have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
I'm still degraded by a factor of eight, though, where I'd expect only
two.

> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
> routines, you can actually keep a CPU pretty busy with this.  Also, even
> though the XOR routines try to time their assembly 'just so' so that
> they can use the cache avoiding instructions, this fails more often than
> not so you end up blowing CPU caches while doing this work, which of
> course effects the overall system.

While 200 MB/s of XOR sounds high, the kernel is "advertising" over
8000 MB/s, per dmesg:

[    6.386820] xor: automatically using best checksumming function:
[    6.396690]    avx       : 24064.000 MB/sec
[    6.414706] raid6: sse2x1   gen()  7636 MB/s
[    6.431725] raid6: sse2x2   gen()  3656 MB/s
[    6.448742] raid6: sse2x4   gen()  3917 MB/s
[    6.465753] raid6: avx2x1   gen()  5425 MB/s
[    6.482766] raid6: avx2x2   gen()  7593 MB/s
[    6.499773] raid6: avx2x4   gen()  8648 MB/s
[    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
[    6.499774] raid6: using avx2x2 recovery algorithm

(CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)

I'm assuming however the kernel does its testing is fairly optimal,
and probably assumes ideal cache behavior... so maybe actual XOR
performance won't be as good as what dmesg suggests... but still, 200
MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
MB/s...

Is it possible to pin kernel threads to a CPU?  I'm thinking I could
reboot with isolcpus=2 (for example) and if I can force that md0_raid6
thread to run on CPU 2, at least the L1/L2 caches should be minimally
affected...

> Possible fixes for this might include:
>         c) Creating a consolidated page cache/stripe cache (if we can read more
> of the blocks needed to get our data from cache instead of disk it helps
> reduce that IO ops issue)

I suppose this might be an explanation for why increasing the array's
stripe_cache_size gave me such a boost?

>         d) Rearchitecting your arrays into raid50 instead of big raid6 array

My colleague tested that exact same config with hardware raid5, and
striped the three raid5 arrays together with software raid1.  So
clearly not apples-to-apples, but he did get dramatically better
degraded and rebuild performance.  I do intend to test a pure software
raid-50 implementation.

> (or conversely has the random head seeks just gone so
> radically through the roof that the problem here really is the time it
> takes the heads to travel everywhere we are sending them).

I'm certain head movement time isn't the issue, as these are SSDs.  :)

On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@xxxxxxxxxx> wrote:
> Can you run an iostat during your degraded test, and also a top run
> over 20+ seconds with kernel threads showing up.  Even better would be
> a perf capture, but you might not have all the tools installed.  You
> can always try:
>
> perf record -a sleep 20
>
> then
>
> perf report
>
> should show you the top functions globally over the 20 second sample.
> If you don't have perf loaded, you might (or might not) be able to
> load it from the distro.

Running top for 20 or more seconds, the top processes in terms of CPU
usage are pretty static:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
 1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
  107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
  108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
 6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd

I truncated the output.  The "dd" processes are part of our testing
tool that generates the huge read load on the array.  Any given "dd"
process might jump around, but those four kernel processes are always
the top four.  (Note that before I increased the stripe_cache_size (as
mentioned above), the md0_raid6 process was only consuming around 50%
CPU.)

Here is a representative view of a non-first iteration of "iostat -mxt 5":

08/23/2016 01:37:59 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.84    0.00   27.41   67.59    0.00    0.17

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdy               0.00     0.40    0.80    0.60     0.05     0.00
83.43     0.00    1.00    0.50    1.67   1.00   0.14
sdz               0.00     0.40    0.00    0.60     0.00     0.00
10.67     0.00    2.00    0.00    2.00   2.00   0.12
sdd           12927.00     0.00  204.40    0.00    51.00     0.00
511.00     5.93   28.75   28.75    0.00   4.31  88.10
sde           13002.60     0.00  205.20    0.00    51.20     0.00
511.00     6.29   30.39   30.39    0.00   4.59  94.12
sdf           12976.80     0.00  205.00    0.00    51.00     0.00
509.50     6.17   29.76   29.76    0.00   4.57  93.78
sdg           12950.20     0.00  205.60    0.00    50.80     0.00
506.03     6.20   29.75   29.75    0.00   4.57  93.88
sdh           12949.00     0.00  207.20    0.00    50.90     0.00
503.11     6.36   30.35   30.35    0.00   4.59  95.10
sdb           12196.40     0.00  192.60    0.00    48.10     0.00
511.47     5.48   28.15   28.15    0.00   4.38  84.36
sda               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdi           12923.00     0.00  208.40    0.00    51.00     0.00
501.20     6.79   32.31   32.31    0.00   4.65  96.84
sdj           12796.20     0.00  206.80    0.00    50.50     0.00
500.12     6.62   31.73   31.73    0.00   4.62  95.64
sdk           12746.60     0.00  204.00    0.00    50.20     0.00
503.97     6.38   30.77   30.77    0.00   4.60  93.86
sdl           12570.00     0.00  202.20    0.00    49.70     0.00
503.39     6.39   31.19   31.19    0.00   4.63  93.68
sdn           12594.00     0.00  204.20    0.00    49.95     0.00
500.97     6.40   30.99   30.99    0.00   4.58  93.54
sdm           12569.00     0.00  203.80    0.00    49.90     0.00
501.45     6.30   30.58   30.58    0.00   4.45  90.60
sdp           12568.80     0.00  205.20    0.00    50.10     0.00
500.03     6.37   30.79   30.79    0.00   4.52  92.72
sdo           12569.20     0.00  204.00    0.00    49.95     0.00
501.46     6.40   31.07   31.07    0.00   4.58  93.42
sdw           12568.60     0.00  206.20    0.00    50.00     0.00
496.60     6.34   30.71   30.71    0.00   4.24  87.48
sdx           12038.60     0.00  197.40    0.00    47.60     0.00
493.84     6.01   30.21   30.21    0.00   4.40  86.86
sdq           12570.20     0.00  204.20    0.00    50.15     0.00
502.97     6.23   30.41   30.41    0.00   4.44  90.68
sdr           12571.00     0.00  204.60    0.00    50.25     0.00
502.99     6.15   30.26   30.26    0.00   4.18  85.62
sds           12495.20     0.00  203.80    0.00    49.95     0.00
501.95     6.00   29.62   29.62    0.00   4.24  86.38
sdu           12695.60     0.00  207.80    0.00    50.65     0.00
499.17     6.22   30.00   30.00    0.00   4.16  86.38
sdv           12619.00     0.00  207.80    0.00    50.35     0.00
496.22     6.23   30.03   30.03    0.00   4.20  87.32
sdt           12671.20     0.00  206.20    0.00    50.50     0.00
501.56     6.05   29.30   29.30    0.00   4.24  87.44
sdc           12851.60     0.00  203.00    0.00    50.70     0.00
511.50     5.84   28.49   28.49    0.00   4.17  84.64
md126             0.00     0.00    0.60    1.00     0.05     0.00
71.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.60    0.80     0.05     0.00
81.14     0.00    2.29    0.67    3.50   1.14   0.16
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00 4475.20    0.00  1110.95     0.00
508.41     0.00    0.00    0.00    0.00   0.00   0.00

sdy and sz are the system drives, so they are uninteresting.

sda is the md0 drive I failed, that's why it stays at zero.

And lastly, here's the output of the perf commands you suggested (at
least the top part):

Samples: 561K of event 'cycles', Event count (approx.): 318536644203
Overhead  Command         Shared Object                 Symbol
  52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
   4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
   3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
   2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
   2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
   1.75%  rngd            rngd                          [.] 0x000000000000288b
   1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
   1.49%  dd              [kernel.kallsyms]             [k]
copy_user_enhanced_fast_string
   1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
   0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
   0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
   0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
   0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
   0.51%  ps              [kernel.kallsyms]             [k] format_decode
   0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
   0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
   0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg

That's my first time using the perf tool, so I need a little hand-holding here.

Thanks again all!
Matt
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html