Re: Performance Testing MD-RAID10 with 1 failed drive

Roger Heflin <rogerheflin@xxxxxxxxx> · Fri, 21 Oct 2022 11:53:25 -0500

A performance hit or not depends on exactly how high the IO load is.
If the fully redundant array is running at the iops limit for said
devices then any reads suddenly having to be serviced by a single
device will overload the array.   For any IO's that could go to the
2-disk mirror will have to get handled by a single disk now and will
overload that single disk if the IO load is too much.

For the most part the number of devices just increases the IO capacity
(raid-10 performs as a striped raid-1).

Benchmarking it requires knowing detail about the IO load, iops gets
hard to understand when you say have a write cache and have a 4k
blocks that get written and synced at say 100 bytes at at time (400
IOPS to that single block, but will be merged by the write cache.  And
if your defined benchmark differs from your actual load that results
will not be useful for guessing when the real load will break it.  And
if 2 iops are on the same disk track (sequential IO) then if merged
right there will not need to be an expensive seek between them.    And
nothing is write only, a lot of reads of the underlying fs data has to
be done for a write to happen (allocate blocks-bookkeeping-move from
free list, to the file being writtens data), and those reads will be
using the 2 disk mirror that has a failed disk and all reads are now
being handled by a single device.

If you had the total iops and/or sar data from a few minutes when it
was overloading (the LV's, md* sd* devices) for a few minutes you
could probably see it.  Generally it is almost impossible to get the
benchmark "right" such that it will be useful for telling when the
application will overload the disk devices.

I troubleshoot  a lot of DB io load "issues" and said DB's are all
running the same application code, but each has slightly different
underlying workloads and can look significantly different and overload
the underlying disk array in very different ways, depending on either
what the DB is doing wrong, or how the clients is doing queries and/or
defineds their workflows.

The give way is watching the await times, and %util numbers.

On Fri, Oct 21, 2022 at 10:30 AM Andy Smith <andy@xxxxxxxxxxxxxx> wrote:
>
> Hello,
>
> On Fri, Oct 21, 2022 at 06:51:41AM -0500, Roger Heflin wrote:
> > The original poster needs to get sar or iostat stat to see what the
> > actual io rates are, but if they don't understand what the spinning
> > disk array can do fully redundant and with a disk failed it is not
> > unlikely that the IO load is higher than a can be sustained with a
> > single disk failed.
>
> Though OP is using RAID-10 not RAID-1, and with more than 2 devices
> IIRC. OP wants to check the performance and I agree they should do
> that for both the normal case and the degraded case, but what are we
> expecting *in theory*? For RAID-10 on 4 devices we wouldn't expect
> much performance hit would we? Since a read is striped across 2
> devices and there's a mirror of each so it'll read from the good
> half of the mirror for each read IO.
>
> --
> https://bitfolk.com/ -- No-nonsense VPS hosting