Re: sequential versus random I/O

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Sat, 01 Feb 2014 12:28:36 -0600

On 1/30/2014 9:28 AM, Matt Garman wrote:
> On Thu, Jan 30, 2014 at 4:22 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
>> I wouldn't go used as they do.  Not for something this critical.
> 
> No, not for an actual production system.  I linked that as "conceptual
> inspiration" not as an exact template for what I'd do.  Although, the
> used route might be useful for building a cheap prototype to
> demonstrate proof of concept.
> 
>> If you architect the system correctly, and use decent quality hardware,
>> it won't blow up on you.  If you don't get the OS environment tuned
>> correctly you'll simply get less throughput than desired.  But that can
>> always be remedied with tweaking.
> 
> Right.  I think the general concept is solid, but, as with most
> things, "the devil's in the details".  

Always.

> FWIW, the creator of the DCDW
> enumerated some of the "gotchas" for a build like this[1].  He went
> into more detail in some private correspondence with me.  It's a
> little alarming that he got roughly 50% the performance with a tuned
> Linux setup compared to a mostly out-of-the-box Solaris install.

Most x86-64 Linux distro kernels are built to perform on servers,
desktops, and laptops, thus performance on each is somewhat compromised.
 Solaris x86-64 is built primarily for server duty, and tuned for that
out of the box.  So what you state above isn't too surprising.

> Also, subtle latency issues with PCIe timings across different
> motherboards sounds like a migraine-caliber headache.

This is an issue of board design and Q.A., specifically trace routing
and resulting signal skew, that the buyer can't do anything about.  And
unfortunately this kind of information just isn't "out there" in reviews
and what not when you buy boards.  The best one can do is buy reputable
brand and cross fingers.

>> Each of the two backplanes has 24 drive slots and 6 SFF-8087 connectors.
>>  Each 8087 carries 4 SAS channels.  You connect two ports of each HBA to
>> the top backplane and the other two to the bottom backplane.  I.e. one
>> ...
> 
> Your concept is similar to what I've sketched out in my mind.  My
> twist is that I think I would actually build multiple servers, each
> one would be a 24-disk 2U system.  Our data is fairly easy to
> partition across multiple servers.  Also, we already have a big
> "symlink index" directory that abstracts the actual location of the
> files.  IOW, my users don't know/don't care where the files actually
> live, as long as the symlinks are there and not broken.

That makes tuning each box much easier if you go with a single 10GbE
port.  But this has some downside I'll address down below.

>> Without the cost of NICs you're looking at roughly $19,000 for this
>> configuration, including shipping costs, for a ~22TB DIY SSD based NFS
>> server system expandable to 46TB.  With two quad port 10GbE NICs and
>> SFPs you're at less $25K with the potential for ~6GB/s NFS throughput.
> 
> Yup, and this amount is less than one year's maintenance on the big
> iron system we have in place.  And, quoting the vendor, "Maintenance
> costs only go up."

Yes, it's sad.  "Maintenance Contract" = mostly pure profit.  This is
the Best Buy extended warranty of the big iron marketplace.  You pay a
ton of money and get very little, if anything, in return.

>> In specifying HBAs instead of RAID controllers I am assuming you'll use
>> md/RAID.  With this many SSDs any current RAID controller would slow you
>> down anyway as the ASICs aren't fast enough.  You'll need minimum
>> redundancy to guard against an SSD failure, which means RAID5 with SSDs.
>>  Your workload is almost exclusively read heavy, which means you could
>> simply create a single 24 drive RAID5 or RAID6 with the default 512KB
>> chunk.  I'd go with RAID6.  That will yield a stripe width of
>> 22*512KB=11MB.  Using RAID5/6 allows you to grow the array incrementally
>> without the need for LVM which may slow you down.
> 
> At the expense of storage capacity, I was in my mind thinking of
> raid10 with 3-way mirrors.  We do have backups, but downtime on this
> system won't be taken lightly.

I was in lock step with you until this point.  We're talking about SSDs
aren't we?  And a read-only workload?  RAID10 today is only for
transactional workloads on rust to avoid RMW.  SSD doesn't suffer RMW
latency.  And this isn't a transactional workload, but parallel linear
read.  Three-way mirroring within a RAID 10 setup is used strictly to
avoid losing the 2nd disk in a mirror while its partner is rebuilding in
a standard RAID10.  This is suitable when using large rusty drives where
rebuild times are 8+ hours.  With a RAID10 triple mirror setup 2/3rds of
your capacity is wasted.  This isn't a sane architecture for SSDs and a
read-only workload.  Here's why.  Under optimal conditions

a 4TB 7.2K SAS/SATA mirror rebuild takes 4TB / 130MB/s= ~8.5 hours
a 1TB Sammy 840 EVO mirror rebuild takes 1TB / 500MB/s= ~34 minutes.

A RAID6 rebuild will take a little longer, but still much less than an
hour, say 45 minutes max.  With RAID6 you would have to sustain *two*
additional drive failures within that 45 minute rebuild window to lose
the array.  Only HBA, backplane, or PSU failure could take down two more
drives in 45 minutes, and if that happens you're losing many drives,
probably all of them, and you're sunk anyway.  No matter how you slice
it, I can't see RAID10 being of any benefit here, and especially not
3-way mirror RAID10.

If one of your concerns is decreased client throughput during rebuild,
then simply turn down the rebuild priority to 50%.  Your rebuild will
take 1.5 hours, in which you'd have to lose 2 additional drives to lose
the array, and you'll still have more client throughput at the array
than the network interface can push:

((22*500MB/s) = 11GB/s)/2 = 5.5GB/s client B/W during rebuild
10GbE interface B/W =       1.0GB/s max

Using RAID10 yields no gain but increases cost.  Using RAID10 with 3
mirrors is simply 3 times more cost and 2/3rds wasted capacity.  Any
form of mirroring just isn't suitable for this type of SSD system.

>> Surely you'll use XFS as it's the only Linux filesystem suitable for
>> such a parallel workload.  As you will certainly grow the array in the
>> future, I'd format XFS without stripe alignment and have it do 4KB IOs.
>> ...
> 
> I was definitely thinking XFS.  But one other motivation for multiple
> 2U systems (instead of one massive system) is that it's more modular.

The modular approach has advantages.  But keep in mind that modularity
increases complexity and component count, which increase the probability
of a failure.  The more vehicles you own the more often one of them is
in the shop at any given time, if even only for an oil change.

> Existing systems never have to be grown or reconfigured.  When we need
> more space/throughput, I just throw another system in place.  I might
> have to re-distribute the data, but this would be a very rare (maybe
> once/year) event.

Gluster has advantages here as it can redistribute data automatically
among the storage nodes.  If you do distributed mirroring you can take a
node completely offline for maintenance, and client's won't skip a beat,
or at worst a short beat.  It costs half your storage for the mirroring,
but using RAID6 it's still ~33% less than the RAID10 w/3 way mirrors.

> If I get the green light to do this, I'd actually test a few
> configurations.  But some that come to mind:
>     - raid10,f3

Skip it.  RAID10 is a no go.  And, none of the alternate layouts will
provide any benefit because SSDs are not spinning rust.  The alternate
layouts exist strictly to reduce rotational latency.

>     - groups of 3-way raid1 mirrors striped together with XFS

I covered this above.  Skip it.  And you're thinking of XFS over
concatenated mirror sets here.  This architecture is used only for high
IOPS transactional workloads on rust.  It won't gain you anything with SSDs.

>     - groups of raid6 sets not striped together (our symlink index I
> mentioned above makes this not as messy as it sounds)

If you're going with multiple identical 24 bay nodes, you want a single
24 drive md/RAID6 in each directly formatted with XFS.  Or Gluster atop
XFS.  It's the best approach for your read only workload with large files.

>> The last point I'll make is that it may require some serious tweaking of
>> IRQ load balancing, md/RAID, NFS, Ethernet bonding driver, etc, to wring
>> peak throughput out of such a DIY SSD system.  Achieving ~1GB/s parallel
>> NFS throughput from a DIY rig with a single 10GbE port isn't horribly
>> difficult.  3+GB/s parallel NFS via bonded 10GbE interfaces is a bit
>> more challenging.
> 
> I agree, I think that comes back around to what we said above: the
> concept is simple, but the details mean the difference between
> brilliant and mediocre.

The details definitely become a bit easier with one array and one NIC
per node.  But one thing really bothers me about such a setup.  You have
~11GB/s read throughput with 22 SSDs (24 RAID6).  It doesn't make sense
to waste ~10GB/s of SSD throughput by using a single 10GbE interface.
At the very least you should be using 4x 10GbE ports per box to achieve
potentially 3+ GB/s.

I think what's happening here is that you're saving so much money
compared to the proprietary NAS filer that you're intoxicated by the
savings.  You're throwing money around on SSDs like a drunken sailor on
6 month leave at a strip club. :)  And without fully understanding the
implications, the capability that you're buying, and putting in each box.

> Thanks for your input Stan, I appreciate it.  I'm an infrequent poster
> to this list, but a long-time reader, and I've learned a lot from your
> posts over the years.

Glad someone actually reads my drivel on occasion. :)

I'm firmly an AMD guy.  I used the YMI 48 bay Intel server in my
previous example for expediency, and to avoid what I'm doing here now.
Please allow me to indulge you with a complete parts list for one fully
DIY NFS server node build.  I've matched and verified compatibility of
all of the components, using manufacturer specs, down to the iPASS/SGPIO
SAS cables.  Combined with the LSI HBAs and this SM backplane, these
sideband signaling SAS cables should enable you to make drive failure
LEDs work with mdadm, using:
http://sourceforge.net/projects/ledmon/

I've not tried the software myself, but if it's up to par, dead drive
identification should work the same as with any vendor storage array,
which to this point has been nearly impossible with md arrays using
plain non-RAID HBAs.

Preemptively flashing the mobo and SAS HBAs with the latest firmware
image should prevent any issues with the hardware.  These products have
"shelf" firmware which is often quite a few revs old by the time the
customer receives product.

All but one of the necessary parts are stocked by NewEgg believe it or
not.  The build consists of a 24 bay 2U SuperMicro 920W dual HS PSU,
SuperMicro dual socket C32 mobo w/5 PCIe 2.0 x8 slots, 2x Opteron 4334
3.1GHz 6 core CPUs, 2 Dynatron C32/1207 2U CPU coolers, 8x Kingston 4GB
ECC registered DDR3-1333 single rank DIMMs, 3x LSI 9207-8i PCIe 3.0 x8
SAS HBAs, rear 2 drive HS cage, 2x Samsung 120GB boot SSDs, 24x Samsung
1TB data SSDs, 6x 2ft LSI SFF-8087 sideband cables, and two dual port
Intel 10GbE NICs sans SFPs as you probably already have some spares.
You may prefer another NIC brand/model.  These are <$800 of the total.

1x  http://www.newegg.com/Product/Product.aspx?Item=N82E16811152565
1x  http://www.newegg.com/Product/Product.aspx?Item=N82E16813182320
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16819113321
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16835114139
8x  http://www.newegg.com/Product/Product.aspx?Item=N82E16820239618
3x  http://www.newegg.com/Product/Product.aspx?Item=N82E16816118182
24x http://www.newegg.com/Product/Product.aspx?Item=N82E16820147251
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16820147247
6x  http://www.newegg.com/Product/Product.aspx?Item=N82E16812652015
2x  http://www.newegg.com/Product/Product.aspx?Item=N82E16833106044

1x
http://www.costcentral.com/proddetail/Supermicro_Storage_drive_cage/MCP220826090N/11744345/

Total cost today:  $16,927.23
SSD cost:          $13,119.98

Note all SSDs are direct connected to the HBAs.  This system doesn't
suffer any disk bandwidth starvation due to SAS expanders as with most
storage arrays.  As such you get nearly full bandwidth per drive, being
limited only by the NorthBridge to CPU HT link.  At the hardware level
the system bandwidth breakdown is as follow:

Memory:		42.6 GB/s
PCIe to CPU:	10.4 GB/s unidirectional x2
HBA to PCIe:	12 GB/s unidirectional x2
SSD to HBA:	12 GB/s unidirectional x2
PCIe to NIC:	8 GB/s unidirectional x2
NIC to client:  4 GB/s unidirectional x2

Your HBA traffic will flow on the HT uplink and your NIC traffic on the
down link, so you are non constrained here with this NFS read only workload.

Assuming an 8:1 bandwidth ratio between file bytes requested and memory
bandwidth consumed by the kernel in the form of DMA xfers to/from buffer
cache, memory-memory copies for TCP and NFS, and hardware overhead in
the form of coherency traffic between the CPUs and HBAs, interrupts in
the form of MSI-X writes to memory, etc, then 4GB/s of requested data
generates ~32GB/s at the memory controllers before transmitted over the
wire.  Beyond tweaking parameters, it may require building a custom
kernel to achieve this throughput.  But the hardware is capable.  Using
a single 10GbE interface yields 1/10th of the SSD b/w to clients.  This
is a huge waste of $$ spent on the SSDs.  Using 4 will come close to
maxing out the rest of the hardware so I spec'd 4 ports.  With the
correct bonding setup you should be able to get between 3-4GB/s.  Still
only 1/4th - 1/3rd the SSD throughput.

To get close to taking near full advantage of the 12GB/s read bandwidth
offered by these 24 SSDs requires a box with dual Socket G34 processors
to get 8 DDR3-1333 memory channels--85GB/s--two SR5690 PCIe to HT
controllers, 8x 10GbE ports (or 2x QDR Infiniband 4x).  Notice I didn't
discuss CPU frequency or core count anywhere?  That's because it's not a
factor.  The critical factor is memory bandwidth.  Any single/dual
Opteron 4xxx/6xxx system with ~8 or more cores should do the job as long
as IRQs are balanced across cores.

Hope you at least found this an interesting read, if not actionable.
Maybe others will as well.  I had some fun putting this one together.  I
think the only things I omitted were Velcro straps and self stick lock
loops for tidying up the cables for optimum airflow.  Experienced
builders usually have these on hand, but I figured I'd mention them just
in case.  Locating certified DIMMs in the clock speed and rank required
took too much time, but this was not unforeseen.  The only easy way to
spec memory for server boards and allow max expansion is to go with the
lowest clock speed.  If I'd done that here you'd lose 17GB/s of memory
bandwidth, a 40% reduction.  I also wanted to use a single socket G34
board, but unfortunately nobody makes one with more than 3 PCIe slots.
This design required at least 4.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html