Re: best base / worst case RAID 5,6 write speeds

Dallas Clement <dallas.a.clement@xxxxxxxxx> · Tue, 15 Dec 2015 13:44:06 -0600



On Tue, Dec 15, 2015 at 1:22 PM, Phil Turmel <philip@xxxxxxxxxx> wrote:
> Hi Dallas,
>
> On 12/15/2015 12:30 PM, Dallas Clement wrote:
>> Thanks guys for all the ideas and help.
>>
>> Phil,
>>
>>> Very interesting indeed. I wonder if the extra I/O in flight at high
>>> depths is consuming all available stripe cache space, possibly not
>>> consistently. I'd raise and lower that in various combinations with
>>> various combinations of iodepth.  Running out of stripe cache will cause
>>> premature RMWs.
>>
>> Okay, I'll play with that today.  I have to confess I'm not sure that
>> I completely understand how the stripe cache works.  I think the idea
>> is to batch I/Os into a complete stripe if possible and write out to
>> the disks all in one go to avoid RMWs.  Other than alignment issues,
>> I'm unclear on what triggers RMWs.  It seems like as Robert mentioned
>> that if the I/Os block size is stripe aligned, there should never be
>> RMWs.
>>
>> My stripe cache is 8192 btw.
>>
>
> Stripe cache is the kernel's workspace to compute parity or to recover
> data from parity.  It works on 4k blocks.  Per "man md", the units are
> number of such blocks per device.  *The blocks in each cache stripe are
> separated from each other on disk by the chunk size*.
>
> Let's examine some scenarios for your 128k chunk size, 12 devices.  You
> have 8192 cache stripes of 12 blocks each:
>
> 1) Random write of 16k.  4 stripes will be allocated from the cache for
> *all* of the devices, and filled for the devices written.  The raid5
> state machine lets them sit briefly for a chance for more writes to the
> other blocks in each stripe.
>
> 1a) If none come in, MD will request a read of the old data blocks and
> the old parities.  When those arrive, it'll compute the new parities and
> write both parities and new data blocks.  Total I/O: 32k read, 32k write.
>
> 1b) If other random writes come in for those stripes, chunk size spaced,
> MD will wait a bit more.  Then it will read in any blocks that weren't
> written, compute parity, and write all the new data and parity.  Total
> I/O: 16k * n, possibly some reads, the rest writes.
>
> 2) Sequential write of stripe-aligned 1408k.  The first 128k allocates
> 64 cache stripes and fills their first block.  The next 128k fills the
> second block of each cache stripe.  And so on, filling all the data
> blocks in the cache stripes.  MD shortly notices a full cache stripe
> write on each, so it just computes the parities and submits all of those
> writes.
>
> 3) Sequential write of 256k, aligned or not.  As above, but you only
> fill two blocks in each cache stripe.  MD then reads 1152k, computes
> parity, and writes 384k.
>
> 4) Multiple back-to-back writes of 1408k aligned.  First grabs 64 cache
> stripes and shortly queues all of those writes.  Next grabs another 64
> cache stripes and queues more writes. And then another 64 caches stripes
> and writes.  Underlying layer, as its queue grows, notices the adjacency
> of chunk writes from multiple top-level writes and starts merging.
> Stripe caches are still held, though, until each write is completed.  If
> 128 top-level writes are in flight (8192/64), you've exhausted your
> stripe cache.  Note that this is writes in flight in your application
> *and* writes in flight from anything else.  Keeping in mind that merging
> might actually raise the completion latency for the earlier writes.
>
> I'm sure you can come up with more.  The key is that stripe parity
> calculations must be performed on blocks separated on disk by the chunk
> size.  Really big chunk sizes don't actually help parity raid, since
> everything is broken down to 4k for the stripe cache, then re-merged
> underneath it.
>
>> I with this were for fun! ;)  Although this has been a fun discussion.
>> I've learned a ton.  This effort is for work though.  I'd be all over
>> the SSDs and caching otherwise.  I'm trying to characterize and then
>> squeeze all of the performance I can out of a legacy NAS product.  I
>> am constrained by the existing hardware.  Unfortunately I do not have
>> the option of using SSDs or hardware RAID controllers.  I have to rely
>> completely on Linux RAID.
>>
>> I also need to optimize for large sequential writes (streaming video,
>> audio, large file transfers), iSCSI (mostly used for hosting VMs), and
>> random I/O (small and big files) as you would expect with a NAS.
>
> On spinning rust, once you introduce any random writes, you've
> effectively made the entire stack a random workload.  This is true for
> all raid levels, but particularly true for parity raid due to the RMW
> cycles.  If you really need great sequential performance, you can't
> allow the VMs and the databases and small files on the same disks.
>
> That said, I recommend a parity raid chunk size of 16k or 32k for all
> workloads.  Greatly improves spatial locality for random writes, reduces
> stripe cache hogging for sequential writes, and doesn't hurt sequential
> reads too much.
>
> Phil

Wow!  Thanks a ton Phil.  This is incredibly helpful!  It looks like I
need to do some experimenting with smaller chunk sizes.  Just one more
question:  what stripe cache size do you recommend for this system?
It has 8 GB of RAM, but can't use all of it for RAID as this NAS needs
to run multiple applications.  I understand that in the >= 4.1 kernels
the stripe cache grows dynamically.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html