Re: raid5 to utilize upto 8 cores

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/16/2012 2:52 AM, David Brown wrote:
> On 16/08/2012 07:58, Stan Hoeppner wrote:
>> On 8/15/2012 9:56 PM, vincent Ferrer wrote:
>>
>>> - My  storage server  has upto 8 cores  running linux kernel 2.6.32.27.
>>> - I created  a raid5 device of  10  SSDs .
>>> -  It seems  I only have single raid5 kernel thread,  limiting  my
>>> WRITE  throughput  to single cpu  core/thread.
>>
>> The single write threads of md/RAID5/6/10 are being addressed by patches
>> in development.  Read the list archives for progress/status.  There were
>> 3 posts to the list today regarding the RAID5 patch.
>>
>>> Question :   What are my options to make  my raid5 thread use all the
>>> CPU cores ?
>>>                    My SSDs  can do much more but  single raid5 thread
>>> from mdadm   is becoming the bottleneck.
>>>
>>> To overcome above single-thread-raid5 limitation (for now)  I 
>>> re-configured.
>>>       1)  I partitioned  all  my  10 SSDs into 8  partitions:
>>>       2)  I created  8   raid5 threads. Each raid5 thread having
>>> partition from each of the 8 SSDs
>>>       3)  My WRITE performance   quadrupled  because I have 8 RAID5
>>> threads.
>>> Question: Is this workaround a   normal practice  or may give me
>>> maintenance problems later on.
>>
>> No it is not normal practice.  I 'preach' against it regularly when I
>> see OPs doing it.  It's quite insane.  The glaring maintenance problem
>> is that when one SSD fails, and at least one will, you'll have 8 arrays
>> to rebuild vs one.  This may be acceptable to you, but not to the
>> general population.  With rust drives, and real workloads, it tends to
>> hammer the drive heads prodigiously, increasing latency and killing
>> performance, and decreasing drive life.  That's not an issue with SSD,
>> but multiple rebuilds is.  That and simply keeping track of 80
>> partitions.
>>
> 
> The rebuilds will, I believe, be done sequentially rather than in
> parallel.  And each rebuild will take 1/8 of the time a full array
> rebuild would have done.  So it really should not be much more time or
> wear-and-tear for a rebuild of this monster setup, compared to a single
> raid5 array rebuild.  (With hard disks, it would be worse due to head
> seeks - but still not as bad as you imply, if I am right about the
> rebuilds being done sequentially.)
> 
> However, there was a recent thread here about someone with a similar
> setup (on hard disks) who had a failure during such a rebuild and had
> lots of trouble.  That makes me sceptical to this sort of multiple array
> setup (in addition to Stan's other points).
> 
> And of course, all Stan's other points about maintenance, updates to
> later kernels with multiple raid5 threads, etc., still stand.
> 
>> There are a couple of sane things you can do today to address your
>> problem:
>>
>> 1.  Create a RAID50, a layered md/RAID0 over two 5 SSD md/RAID5 arrays.
>>   This will double your threads and your IOPS.  It won't be as fast as
>> your Frankenstein setup and you'll lose one SSD of capacity to
>> additional parity.  However, it's sane, stable, doubles your
>> performance, and you have only one array to rebuild after an SSD
>> failure.  Any filesystem will work well with it, including XFS if
>> aligned properly.  It gives you an easy upgrade path-- as soon as the
>> threaded patches hit, a simple kernel upgrade will give your two RAID5
>> arrays the extra threads, so you're simply out one SSD of capacity.  You
>> won't need to, and probably won't want to rebuild the entire thing after
>> the patch.  With the Frankenstein setup you'll be destroying and
>> rebuilding arrays.  And if these are consumer grade SSDs, you're much
>> better off having two drives worth of redundancy anyway, so a RAID50
>> makes good sense all around.
>>
>> 2.  Make 5 md/RAID1 mirrors and concatenate them with md/RAID linear.
>> You'll get one md write thread per RAID1 device utilizing 5 cores in
>> parallel.  The linear driver doesn't use threads, but passes offsets to
>> the block layer, allowing infinite core scaling.  Format the linear
>> device with XFS and mount with inode64.  XFS has been fully threaded for
>> 15 years.  Its allocation group design along with the inode64 allocator
>> allows near linear parallel scaling across a concatenated device[1],
>> assuming your workload/directory layout is designed for parallel file
>> throughput.
>>
>> #2, with a parallel write workload, may be competitive with your
>> Frankenstein setup in both IOPS and throughput, even with 3 fewer RAID
>> threads and 4 fewer SSD "spindles".  It will outrun the RAID50 setup
>> like it's standing still.  You'll lose half your capacity to redundancy
>> as with RAID10, but you'll have 5 write threads for md/RAID1, one per
>> SSD pair.  One core should be plenty to drive a single SSD mirror, with
>> plenty of cycles to spare for actual applications, while sparing 3 cores
>> for apps as well.  You'll get unlimited core scaling with both md/linear
>> and XFS.  This setup will yield the best balance of IOPS and throughput
>> performance for the amount of cycles burned on IO, compared to
>> Frankenstein and the RAID50.
> 
> For those that don't want to use XFS, or won't have balanced directories
> in their filesystem, or want greater throughput of larger files (rather
> than greater average throughput of multiple parallel accesses), you can
> also take your 5 raid1 mirror pairs and combine them with raid0.  You
> should get similar scaling (the cpu does not limit raid0).  For some
> applications (such as mail server, /home mount, etc.), the XFS over a
> linear concatenation is probably unbeatable.  But for others (such as
> serving large media files), a raid0 over raid1 pairs could well be
> better.  As always, it depends on your load - and you need to test with
> realistic loads or at least realistic simulations.

Sure, a homemade RAID10 would work as it avoids the md/RAID10 single
write thread.  I intentionally avoided mentioning this option for a few
reasons:

1.  Anyone needing 10 SATA SSDs obviously has a parallel workload
2.  Any thread will have up to 200-500MB/s available (one SSD)
    with a concat, I can't see a single thread needing 4.5GB/s of B/W
    If so, md/RAID isn't capable, not on COTS hardware
3.  With a parallel workload requiring this many SSDs, XFS is a must
4.  With a concat, mkfs.xfs is simple, no stripe aligning, etc
    ~$ mkfs.xfs /dev/md0

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux