Re: kernel checksumming performance vs actual raid device performance

Doug Ledford <dledford@xxxxxxxxxx> · Tue, 23 Aug 2016 16:15:20 -0400

On 8/23/2016 3:26 PM, Matt Garman wrote:
> Doug & Doug,
> 
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
> 
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
> 
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
> 
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.

OK, 50 sequential I/Os at a time.  Good point to know.

> 
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
> 
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
> 
> Personalities : [raid1] [raid6] [raid5] [raid4]
> 
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
> 
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
> 
>       bitmap: 0/15 pages [0KB], 65536KB chunk

You're raid device has a good chunk size for your usage pattern.  If you
had a smallish chunk size (like 64k or 32k), I would actually expect
things to behave differently.  But, then again, maybe I'm wrong and that
would help.  With a smaller chunk size, you would be able to fit more
stripes in the stripe cache using less memory.

> 
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
> 
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.

Makes sense.  I know the stripe cache size is conservative by default
because of the fact that it's not shared with the page cache, so you
might as well consider it's memory lost.  When you upped it to 64k, and
you have 22 disks at 512k chunk, that 11MB per stripe and 65536 total
allowed stripes which is a maximum memory consumption of around 700GB
RAM.  I doubt you have that much in your machine, so I'm guessing it's
simply using all available RAM that the page cache or something else
isn't already using.  That's also explains why setting it higher doesn't
provide any additional benefits ;-).

>  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.

You probably have maxed out your single CPU performance and won't see
any benefit without having a multi-threaded XOR routine.

> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
> 
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
> 
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
> 
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
> 
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
> 
> I'm assuming however the kernel does its testing is fairly optimal,

It is *highly* optimal.  What's more, it uses 100% CPU during this time.
 The raid6 thread doing your recovery is responsible for lots of stuff,
issuing reads, doing xor, fulfilling write requests, maintaining the
cache, etc.  It has to have time to actually do other work.  So start
with that 8GB/s figure, but immediately start subtracting from that
because the CPU needs to do other things as well.  Then remember that we
are under *extreme* memory pressure.  When you have to bring in 22 reads
in order to reconstruct just 1 block of the same size, then for 100MB/s
of degraded reads you are generating 2200MB/s of PCI DMA -> MEM
bandwidth consumption, followed by 2200MB/s of MEM -> register load
bandwidth consumption, then I'd have to read the avx xor routine to know
how much write bandwidth it is using, but it's at least 100MB/s of
bandwidth, and likely at least four or five times that much because it
probably doesn't do all 22 blocks in a single xor pass, it likely loads
parity, then reads up to maybe four blocks and xors them together and
then stores the parity, so each pass will re-read and re-store the
parity block.  The point of all of this is that people forget to do the
math on the memory bandwidth used by these XOR operations.  The faster
they are, the higher the percentage of main memory bandwidth you are
consuming.  Now you have to subtract all of that main memory bandwidth
from the total main memory bandwidth for the CPU, and what's left over
is all you have for doing other productive work.  Even if you aren't
blowing your caches doing all of this XOR work, you are blowing your
main memory bandwidth.  Other threads or other actions end up stalling
waiting on main memory accesses to complete.

> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests...

It will never be that good, and you can thank your stars that it isn't,
because if it were, your computer would be ground to a halt with nothing
happening but data XOR computations.

> but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...

The math fits.  Most quad channel Intel CPUs have memory bandwidths in
the 50GByte/s range theoretical maximum, but it's not bidirectional,
it's not even multi-access, so you have to remember that the usage looks
like this on a good read:

copy 1: DMA from PCI bus to main memory
copy 2: Load from main memory to CPU for copy_to_user
copy 3: Store from CPU to main memory for user

To get 8GB/s of read performance undregraded then required 24GB/s of
actual memory bandwidth just for the copies.  That's half of your entire
memory bandwidth (unless you have multiple sockets, then things get more
complex, but this is still true for one socket of the multiple socket
machine).  Once you add the XOR routine into the figure, the 3 accesses
is the same for part of it, but for degraded fixups, it is much worse.

> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...

You could try that, but I doubt it will effect much.

>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
> 
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?

Yes.  The default setting is conservative, you told it to use as much
memory as it needed.

>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
> 
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.

That's a huge waste, are you sure he didn't use raid0 for the stripe?

>  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.

I would try it.  If you are OK with single disk failures anyway.

>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
> 
> I'm certain head movement time isn't the issue, as these are SSDs.  :)

Fair enough ;-).  And given these are SSDs, I'd be just fine doing
something like four 6 disk raid5s then striped in a raid0 myself.  The
main cause for concern with spinning disks is latent bad sectors causing
a read error on rebuild, with SSDs that's much less of a concern.

> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@xxxxxxxxxx> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
> 
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
> 
> 
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)

I would try to tune your stripe cache size such that the kswapd?
processes go to sleep.  Those are reading/writing swap.  That won't help
your overall performance.

> Here is a representative view of a non-first iteration of "iostat -mxt 5":
> 
> 
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10

I'm not sure how much I trust some of these numbers.  According to this,
you are issuing 200 read/s, at an average size of 511KB, which should
work out to roughly 100MB/s of data read, but rMB/s is only 51.  I
wonder if the read requests from the raid6 thread are bypassing the
rMB/s accounting because they aren't coming from the VFS or some such?
It would explain why the rMB/s is only half of what it should be based
upon requests and average request size.

> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
> 
> 
> sdy and sz are the system drives, so they are uninteresting.
> 
> sda is the md0 drive I failed, that's why it stays at zero.
> 
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
> 
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
> 
> 
> That's my first time using the perf tool, so I need a little hand-holding here.

You might get more interesting perf results if you could pin the md
raid6 thread to a single CPU and then filter the perf results to just
that CPU.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
    GPG Key ID: 0E572FDD

Attachment:
signature.asc

Description: OpenPGP digital signature