Re: kernel checksumming performance vs actual raid device performance

Doug Dumitru <doug@xxxxxxxxxx> · Tue, 16 Aug 2016 15:51:24 -0700

Matt,

One last thing I would highly recommend is:

Secure erase the replacement disk before rebuilding onto it.

If the replacement disk is "pre conditioned" with random writes, even
if very slowly, this will lower the write performance of the disk
during the rebuild.

On Tue, Aug 16, 2016 at 12:44 PM, Matt Garman <matthew.garman@xxxxxxxxx> wrote:
> Hi Doug & linux-raid list,
>
> On Tue, Jul 12, 2016 at 9:10 PM, Doug Dumitru <doug@xxxxxxxxxx> wrote:
>> You might want to try running "perf" on your system while it is degraded and
>> see where the thread is churning.  I would love to see those results.  I
>> would not be surprised to see that the thread is literally "spinning".  If
>> so, then the 100% cpu is probably fixable, but it won't actually help
>> performance.
>
> I sat on your email for a while, as the machine in question was (is)
> production, and we don't have any useful downtime windows to
> experiment.  But now we have a second, identical machine.  It
> eventually needs to go into production as well, but for now we have
> some time to test.
>
> My understanding of "perf" is that it analyzes an individual process.
> Would you be willing to elaborate on how I might use it while the
> rebuild is taking place?
>
>> In term of single drive missing performance with short reads, you are mostly
>> at the mercy of short read IOPS.  If you array is reading 8K blocks at
>> 2GB/sec, this is at 250,000 IOPS and you kill off a drive, it will jump to
>> 500,000 IOPS.  Reading from the good drives remains as single reads, but
>> read from the missing drives require reads from all of the others (with
>> raid-5, all but one).  I am not sure how the recovery thread issues these
>> recovery read.  Hopefully, it blasts them at the array with abandon (ie,
>> submit all 22 requests concurrently), but the code might be less aggressive
>> in deference to hard disks.  SSDs love deep queue depths.
>
> I may be jumping ahead a little, but I wonder if there are tuning
> parameters that make sense for an array such as this, given the
> read-dominant (effectively WORM) workload?  In particular, things like
> block-level read-ahead, IO scheduler, queue depth, etc.  I know the
> standard answer for these is "test and see" but we don't have a second
> 100-machine compute farm to test with.  It's quite hard to simulate
> such a workload.
>
>> 1)  Consider a single CPU socket solution, like an E6-1650 v3.  Multi-socked
>> CPU introduce NUMA and a whole slew of "interesting" system contention
>> issues.
>
> I think that's a good idea, but I wanted to have two identical systems.
>
>> 2)  Use good HBA that are direct connected to the disks.  I like LSI 3008
>> and the newer 16-port version, although you need to use only 12 ports with
>> 6GBit SATA/SAS to keep from over-running the PCI-e slot bandwidth.
>
> We have three LSI MegaRAID SAS-3 3108 9361-8i controllers per system.
> 8 ports per card.  Drives are indeed direct-connected.  (Technically
> there is a backplane, but it's not an expander, just a pass-through
> backplane for neat cabling.)
>
>> 3)  Do everything you can to hammer deep queue depths.
>
> Can you elaborate on that?
>
>> 4)  Setup IRQ affinity so that the HBAs spread their IRQ requests across
>> cores.
>
> We have spent a lot of time tuning the NIC IRQs, but have not yet
> spent any time on the HBA IRQs.  Will do.
>
>> You can probably mitigate the amount of degradation by lowering the rebuild
>> speed, but this will make the rebuild take longer, so you are messed up
>> either way.  If the server has "down time" at night, you might lower the
>> rebuild to a really small value during the day, and up it at night.
>
> I'll have to discuss with my colleagues, but we have the impression
> that the max rebuild speed parameter is more of a hint than an actual
> "hard" setting.  That is, we tried to do exactly what you suggest:
> defer most rebuild work to after-hours when the load was lighter (and
> no one would notice).  But we were unable to stop the rebuild from
> basically completely crippling the NFS performance during the day.
>
> "Messed up either way" is indeed the right conclusion here.  But I
> think we have some bottleneck somewhere that is artificially hurting,
> making things worse than they could/should be.
>
> Thanks again for the thoughtful feedback!
>
> -Matt

-- 
Doug Dumitru
EasyCo LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html