Re: kernel checksumming performance vs actual raid device performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Matt,

So you are up at 1GB/sec, which is only 1/4 the degraded speed, but
1/2 the expected speed based on drive data transfers required.  This
is actually pretty good.

I should have mentioned the stripe cache parameter before, but I use
raid "differently" and stripe cache does not impact my use case.
Sorry.

The 1GB/sec saturating a core is probably as good as it gets.  This
core is doing a lot of stripe cache page manipulations which are not
all that fast.

Also, the single parity recovery case should be XOR and not the raid-6
logic, so it should be pretty cheap.  Another, not important point for
this issue, is that the benchmarks are to generate parity, not
recover.  Recovery with raid-6 (ie, two drives failed) is more
expensive that the writes.  I am not sure how optimized this is, but
it could be really bad.

If you need this to go faster, then it is either a raid re-design, or
perhaps you should consider cutting your array into two parts.  Two 12
drives raid-6 arrays will give you more bandwidth both because the
failures are less "wide", so a single drive will only do 11 reads
instead of 22.  Plus you get the benefit of two raid-6 threads should
you have dead drives on both halves.  You can raid-0 the arrays
together.  Then again, you lose two drives worth of space.

Doug


On Tue, Aug 23, 2016 at 12:26 PM, Matt Garman <matthew.garman@xxxxxxxxx> wrote:
> Doug & Doug,
>
> Thank you for your helpful replies.  I merged both of your posts into
> one, see inline comments below:
>
> On Tue, Aug 23, 2016 at 2:10 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
>> Of course.  I didn't mean to imply otherwise.  The read size is the read
>> size.  But, since the OPs test case was to "read random files" and not
>> "read random blocks of random files" I took it to mean it would be
>> sequential IO across a multitude of random files.  That assumption might
>> have been wrong, but I wrote my explanation with that in mind.
>
> Yes, multiple parallel sequential reads.  Our test program generates a
> bunch of big random files (file size has an approximately normal
> distribution, centered around 500 MB, going down to 100 MB or so, up
> to a few multi-GB outliers).  The file generation is a one-time thing,
> and we don't really care about its performance.
>
> The read testing program just randomly picks one of those files, then
> reads it start-to-finish using "dd".  But it kicks off several "dd"
> threads at once (currently 50, though this is a run-time parameter).
> This is how we generate the read load, and I use iostat while this is
> running to see how much read throughput I'm getting from the array.
>
>
> On Tue, Aug 23, 2016 at 1:00 PM, Doug Ledford <dledford@xxxxxxxxxx> wrote:
>> This depends a lot on how you structured your raid array.  I didn't see
>> your earlier emails, so I'm inferring from the "one out of 22 reads will
>> be to the bad drive" that you have a 24 disk raid6 array?  If so, then
>> that's 22 data disks and 2 parity disks per stripe.  I'm gonna use that
>> as the basis for my next statement even if it's slightly wrong.
>
> Yes, that is exactly correct, here's the relevant part of /proc/mdstat:
>
> Personalities : [raid1] [raid6] [raid5] [raid4]
>
> md0 : active raid6 sdl[11] sdi[8] sdx[23] sdc[2] sdo[14] sdn[13]
> sdm[12] sdr[17] sdk[10] sdb[1] sdu[20] sdp[15] sdq[16] sds[18] sdt[19]
> sdw[22] sdv[21] sda[0](F) sdj[9] sde[4] sdd[3] sdf[5] sdh[7] sdg[6]
>
>       44005879808 blocks super 1.2 level 6, 512k chunk, algorithm 2
> [24/23] [_UUUUUUUUUUUUUUUUUUUUUUU]
>
>       bitmap: 0/15 pages [0KB], 65536KB chunk
>
>
>> Doug was right in that you will have to read 21 data disks and 1 parity
>> disk to reconstruct reads from the missing block of any given stripe.
>> And while he is also correct that this doubles IO ops needed to get your
>> read data, it doesn't address the XOR load to get your data.  With 19
>> data disks and 1 parity disk, and say a 64k chunk size, you have to XOR
>> 20 64k data blocks for 1 result.  If you are getting 200MB/s, you are
>> actually achieving more like 390MB/s of data read, with 190MB/s of it
>> being direct reads, and then you are using XOR on 200MB/s in order to
>> generate the other 10MB/s of results.
>
> Most of this morning I've been setting/unsetting/changing various
> tunables, to see if I could increase the read speed.  I got a huge
> boost by increasing the /sys/block/md0/md/stripe_cache_size parameter
> from the default (256 IIRC) to 16384.  Doubling it again to 32k didn't
> seem to bring any further benefit.  So with the stripe_cache_size
> increased to 16k, I'm now getting around 1000 MB/s read in the
> degraded state.  When the degraded array was only doing 200 MB/s, the
> md0_raid6 process was taking about 50% CPU according to top.  Now I
> have a 5x increase in read speed, and md0_raid6 is taking 100% CPU.
> I'm still degraded by a factor of eight, though, where I'd expect only
> two.
>
>> 1) 200MB/s of XOR is not insignificant.  Due to our single thread XOR
>> routines, you can actually keep a CPU pretty busy with this.  Also, even
>> though the XOR routines try to time their assembly 'just so' so that
>> they can use the cache avoiding instructions, this fails more often than
>> not so you end up blowing CPU caches while doing this work, which of
>> course effects the overall system.
>
> While 200 MB/s of XOR sounds high, the kernel is "advertising" over
> 8000 MB/s, per dmesg:
>
> [    6.386820] xor: automatically using best checksumming function:
> [    6.396690]    avx       : 24064.000 MB/sec
> [    6.414706] raid6: sse2x1   gen()  7636 MB/s
> [    6.431725] raid6: sse2x2   gen()  3656 MB/s
> [    6.448742] raid6: sse2x4   gen()  3917 MB/s
> [    6.465753] raid6: avx2x1   gen()  5425 MB/s
> [    6.482766] raid6: avx2x2   gen()  7593 MB/s
> [    6.499773] raid6: avx2x4   gen()  8648 MB/s
> [    6.499773] raid6: using algorithm avx2x4 gen() (8648 MB/s)
> [    6.499774] raid6: using avx2x2 recovery algorithm
>
> (CPU is: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz.)
>
> I'm assuming however the kernel does its testing is fairly optimal,
> and probably assumes ideal cache behavior... so maybe actual XOR
> performance won't be as good as what dmesg suggests... but still, 200
> MB/s (or even 1000 MB/s, as I'm now getting), is much lower than 8000
> MB/s...
>
> Is it possible to pin kernel threads to a CPU?  I'm thinking I could
> reboot with isolcpus=2 (for example) and if I can force that md0_raid6
> thread to run on CPU 2, at least the L1/L2 caches should be minimally
> affected...
>
>> Possible fixes for this might include:
>>         c) Creating a consolidated page cache/stripe cache (if we can read more
>> of the blocks needed to get our data from cache instead of disk it helps
>> reduce that IO ops issue)
>
> I suppose this might be an explanation for why increasing the array's
> stripe_cache_size gave me such a boost?
>
>>         d) Rearchitecting your arrays into raid50 instead of big raid6 array
>
> My colleague tested that exact same config with hardware raid5, and
> striped the three raid5 arrays together with software raid1.  So
> clearly not apples-to-apples, but he did get dramatically better
> degraded and rebuild performance.  I do intend to test a pure software
> raid-50 implementation.
>
>> (or conversely has the random head seeks just gone so
>> radically through the roof that the problem here really is the time it
>> takes the heads to travel everywhere we are sending them).
>
> I'm certain head movement time isn't the issue, as these are SSDs.  :)
>
> On Tue, Aug 23, 2016 at 1:27 PM, Doug Dumitru <doug@xxxxxxxxxx> wrote:
>> Can you run an iostat during your degraded test, and also a top run
>> over 20+ seconds with kernel threads showing up.  Even better would be
>> a perf capture, but you might not have all the tools installed.  You
>> can always try:
>>
>> perf record -a sleep 20
>>
>> then
>>
>> perf report
>>
>> should show you the top functions globally over the 20 second sample.
>> If you don't have perf loaded, you might (or might not) be able to
>> load it from the distro.
>
> Running top for 20 or more seconds, the top processes in terms of CPU
> usage are pretty static:
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  1228 root      20   0       0      0      0 R 100.0  0.0 562:16.83 md0_raid6
>  1315 root      20   0    4372    684    524 S  17.3  0.0  57:20.92 rngd
>   107 root      20   0       0      0      0 S   9.6  0.0  65:16.63 kswapd0
>   108 root      20   0       0      0      0 S   8.6  0.0  65:19.58 kswapd1
> 19424 root      20   0  108972   1676    560 D   3.3  0.0   0:00.52 dd
>  6909 root      20   0  108972   1676    560 D   2.7  0.0   0:01.53 dd
> 18383 root      20   0  108972   1680    560 D   2.7  0.0   0:00.63 dd
>
>
> I truncated the output.  The "dd" processes are part of our testing
> tool that generates the huge read load on the array.  Any given "dd"
> process might jump around, but those four kernel processes are always
> the top four.  (Note that before I increased the stripe_cache_size (as
> mentioned above), the md0_raid6 process was only consuming around 50%
> CPU.)
>
> Here is a representative view of a non-first iteration of "iostat -mxt 5":
>
>
> 08/23/2016 01:37:59 PM
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            4.84    0.00   27.41   67.59    0.00    0.17
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdy               0.00     0.40    0.80    0.60     0.05     0.00
> 83.43     0.00    1.00    0.50    1.67   1.00   0.14
> sdz               0.00     0.40    0.00    0.60     0.00     0.00
> 10.67     0.00    2.00    0.00    2.00   2.00   0.12
> sdd           12927.00     0.00  204.40    0.00    51.00     0.00
> 511.00     5.93   28.75   28.75    0.00   4.31  88.10
> sde           13002.60     0.00  205.20    0.00    51.20     0.00
> 511.00     6.29   30.39   30.39    0.00   4.59  94.12
> sdf           12976.80     0.00  205.00    0.00    51.00     0.00
> 509.50     6.17   29.76   29.76    0.00   4.57  93.78
> sdg           12950.20     0.00  205.60    0.00    50.80     0.00
> 506.03     6.20   29.75   29.75    0.00   4.57  93.88
> sdh           12949.00     0.00  207.20    0.00    50.90     0.00
> 503.11     6.36   30.35   30.35    0.00   4.59  95.10
> sdb           12196.40     0.00  192.60    0.00    48.10     0.00
> 511.47     5.48   28.15   28.15    0.00   4.38  84.36
> sda               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdi           12923.00     0.00  208.40    0.00    51.00     0.00
> 501.20     6.79   32.31   32.31    0.00   4.65  96.84
> sdj           12796.20     0.00  206.80    0.00    50.50     0.00
> 500.12     6.62   31.73   31.73    0.00   4.62  95.64
> sdk           12746.60     0.00  204.00    0.00    50.20     0.00
> 503.97     6.38   30.77   30.77    0.00   4.60  93.86
> sdl           12570.00     0.00  202.20    0.00    49.70     0.00
> 503.39     6.39   31.19   31.19    0.00   4.63  93.68
> sdn           12594.00     0.00  204.20    0.00    49.95     0.00
> 500.97     6.40   30.99   30.99    0.00   4.58  93.54
> sdm           12569.00     0.00  203.80    0.00    49.90     0.00
> 501.45     6.30   30.58   30.58    0.00   4.45  90.60
> sdp           12568.80     0.00  205.20    0.00    50.10     0.00
> 500.03     6.37   30.79   30.79    0.00   4.52  92.72
> sdo           12569.20     0.00  204.00    0.00    49.95     0.00
> 501.46     6.40   31.07   31.07    0.00   4.58  93.42
> sdw           12568.60     0.00  206.20    0.00    50.00     0.00
> 496.60     6.34   30.71   30.71    0.00   4.24  87.48
> sdx           12038.60     0.00  197.40    0.00    47.60     0.00
> 493.84     6.01   30.21   30.21    0.00   4.40  86.86
> sdq           12570.20     0.00  204.20    0.00    50.15     0.00
> 502.97     6.23   30.41   30.41    0.00   4.44  90.68
> sdr           12571.00     0.00  204.60    0.00    50.25     0.00
> 502.99     6.15   30.26   30.26    0.00   4.18  85.62
> sds           12495.20     0.00  203.80    0.00    49.95     0.00
> 501.95     6.00   29.62   29.62    0.00   4.24  86.38
> sdu           12695.60     0.00  207.80    0.00    50.65     0.00
> 499.17     6.22   30.00   30.00    0.00   4.16  86.38
> sdv           12619.00     0.00  207.80    0.00    50.35     0.00
> 496.22     6.23   30.03   30.03    0.00   4.20  87.32
> sdt           12671.20     0.00  206.20    0.00    50.50     0.00
> 501.56     6.05   29.30   29.30    0.00   4.24  87.44
> sdc           12851.60     0.00  203.00    0.00    50.70     0.00
> 511.50     5.84   28.49   28.49    0.00   4.17  84.64
> md126             0.00     0.00    0.60    1.00     0.05     0.00
> 71.00     0.00    0.00    0.00    0.00   0.00   0.00
> dm-0              0.00     0.00    0.60    0.80     0.05     0.00
> 81.14     0.00    2.29    0.67    3.50   1.14   0.16
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> md0               0.00     0.00 4475.20    0.00  1110.95     0.00
> 508.41     0.00    0.00    0.00    0.00   0.00   0.00
>
>
> sdy and sz are the system drives, so they are uninteresting.
>
> sda is the md0 drive I failed, that's why it stays at zero.
>
> And lastly, here's the output of the perf commands you suggested (at
> least the top part):
>
> Samples: 561K of event 'cycles', Event count (approx.): 318536644203
> Overhead  Command         Shared Object                 Symbol
>   52.85%  swapper         [kernel.kallsyms]             [k] cpu_startup_entry
>    4.47%  md0_raid6       [kernel.kallsyms]             [k] memcpy
>    3.39%  dd              [kernel.kallsyms]             [k] __find_stripe
>    2.50%  md0_raid6       [kernel.kallsyms]             [k] analyse_stripe
>    2.43%  dd              [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    1.75%  rngd            rngd                          [.] 0x000000000000288b
>    1.74%  md0_raid6       [kernel.kallsyms]             [k] xor_avx_5
>    1.49%  dd              [kernel.kallsyms]             [k]
> copy_user_enhanced_fast_string
>    1.33%  md0_raid6       [kernel.kallsyms]             [k] ops_run_io
>    0.65%  dd              [kernel.kallsyms]             [k] raid5_compute_sector
>    0.60%  md0_raid6       [kernel.kallsyms]             [k] _raw_spin_lock_irq
>    0.55%  ps              libc-2.17.so                  [.] _IO_vfscanf
>    0.53%  ps              [kernel.kallsyms]             [k] vsnprintf
>    0.51%  ps              [kernel.kallsyms]             [k] format_decode
>    0.47%  ps              [kernel.kallsyms]             [k] number.isra.2
>    0.41%  md0_raid6       [kernel.kallsyms]             [k] raid_run_ops
>    0.40%  md0_raid6       [kernel.kallsyms]             [k] __blk_segment_map_sg
>
>
> That's my first time using the perf tool, so I need a little hand-holding here.
>
> Thanks again all!
> Matt



-- 
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux