Re: Performance Testing MD-RAID10 with 1 failed drive

Umang Agarwalla <umangagarwalla111@xxxxxxxxx> · Thu, 20 Oct 2022 12:13:19 +0530

Hello Roger, All,

Thanks for your response.

Yes, the scenario is when the drive completely fails out of raid10. I
know it's not right to approach to run an array with a failed drive.
But what I am trying to understand is, how to benchmark the
performance hit in such a condition.
It's always a priority for us to get the failed drive replaced.

We run kafka brokers on these machines to be specific on the type of
the workload it is handling.

On Thu, Oct 20, 2022 at 4:54 AM Roger Heflin <rogerheflin@xxxxxxxxx> wrote:
>
> Is the  drive completely  failed out of the raid10?
>
> With a drive missing I would only expect read issues, but if the read
> load is high enough that it really needs both disks for the read load,
> then that would cause the writes to be slower if the total IO
> (read+write load) is overloading the disks.
>
> With 7200 rpm disks you can do a max of about 100-150 seeks and/or
> IOPS per second on each disk, any more than that and all performance
> on the disks will start to back up.   It will be worse if the
> application is writing sync to the disks (app guys love sync but fail
> to understand how it interacts with spinning disk hardware).
>
> Sar -d will show the disks and the tps (iops) and the wait time (7200
> disk has seek time of around 5-8ms).   It will also show similar stats
> on the md device itself.  If the device is getting backed up that
> means that app guys failed to understand the ability of the hardware
> and what their application needs.
>
> On Wed, Oct 19, 2022 at 5:11 PM Wols Lists <antlists@xxxxxxxxxxxxxxx> wrote:
> >
> > On 19/10/2022 22:00, Reindl Harald wrote:
> > >
> > >
> > > Am 19.10.22 um 21:30 schrieb Umang Agarwalla:
> > >> Hello all,
> > >>
> > >> We run Linux RAID 10 in our production with 8 SAS HDDs 7200RPM.
> > >> We recently got to know from the application owners that the writes on
> > >> these machines get affected when there is one failed drive in this
> > >> RAID10 setup, but unfortunately we do not have much data around to
> > >> prove this and exactly replicate this in production.
> > >>
> > >> Wanted to know from the people of this mailing list if they have ever
> > >> come across any such issues.
> > >> Theoretically as per my understanding a RAID10 with even a failed
> > >> drive should be able to handle all the production traffic without any
> > >> issues. Please let me know if my understanding of this is correct or
> > >> not.
> > >
> > > "without any issue" is nonsense by common sense
> >
> > No need for the sark. And why shouldn't it be "without any issue"?
> > Common sense is usually mistaken. And common sense says to me the exact
> > opposite - with a drive missing that's one fewer write, so if anything
> > it should be quicker.
> >
> > Given that - on the system my brother was using - the ops guys didn't
> > notice their raid-6 was missing TWO drives, it seems like lost drives
> > aren't particularly noticeable by their absence ...
> >
> > Okay, with a drive missing it's DANGEROUS, but it should not have any
> > noticeable impact on a production system until you replace the drive and
> > it's rebuilding.
> >
> > Unfortunately, I don't know enough to say whether a missing drive would,
> > or should, impact performance.
> >
> > Cheers,
> > Wol