Re: high throughput storage server?

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Sun, 20 Mar 2011 18:22:30 -0500

Roberto Spadim put forth on 3/20/2011 12:32 AM:

> i think it's better contact ibm/dell/hp/compaq/texas/anyother and talk
> about the problem, post results here, this is a nice hardware question
> :)

I don't need vendor assistance to design a hardware system capable of
the 10GB/s NFS throughput target.  That's relatively easy.  I've already
specified one possible hardware combination capable of this level of
performance (see below).  The configuration will handle 10GB/s using the
RAID function of the LSI SAS HBAs.  The only question is if it has
enough individual and aggregate CPU horsepower, memory, and HT
interconnect bandwidth to do the same using mdraid.  This is the reason
for my questions directed at Neil.

> don't tell about software raid, just the hardware to allow this
> bandwidth (10gb/s) and share files

I already posted some of the minimum hardware specs earlier in this
thread for the given workload I described.  Following is a description
of the workload and a complete hardware specification.

Target workload:

10GB/s continuous parallel NFS throughput serving 50+ NFS clients whose
application performs large streaming reads.  At the storage array level
the 50+ parallel streaming reads become a random IO pattern workload
requiring a huge number of spindles due to the high seek rates.

Minimum hardware requirements, based on performance and cost.  Ballpark
guess on total cost of the hardware below is $150-250k USD.  We can't
get the data to the clients without a network, so the specification
starts with the switching hardware needed.

Ethernet switches:
   One HP A5820X-24XG-SFP+ (JC102A) 24 10 GbE SFP ports
      488 Gb/s backplane switching capacity
   Five HP A5800-24G Switch (JC100A) 24 GbE ports, 4 10GbE SFP
      208 Gb/s backplane switching capacity
   Maximum common MTU enabled (jumbo frame) globally
   Connect 12 server 10 GbE ports to A5820X
   Uplink 2 10 GbE ports from each A5800 to A5820X
       2 open 10 GbE ports left on A5820X for cluster expansion
       or off cluster data transfers to the main network
   Link aggregate 12 server 10 GbE ports to A5820X
   Link aggregate each client's 2 GbE ports to A5800s
   Aggregate client->switch bandwidth = 12.5 GB/s
   Aggregate server->switch bandwidth = 15.0 GB/s
   The excess server b/w of 2.5GB/s is a result of the following:
       Allowing headroom for an additional 10 clients or out of cluster
          data transfers
       Balancing the packet load over the 3 quad port 10 GbE server NICs
          regardless of how many clients are active to prevent hot spots
          in the server memory and interconnect subsystems

Server chassis
   HP Proliant DL585 G7 with the following specifications
   Dual AMD Opteron 6136, 16 cores @2.4GHz
   20GB/s node-node HT b/w, 160GB/s aggregate
   128GB DDR3 1333, 16x8GB RDIMMS in 8 channels
   20GB/s/node memory bandwidth, 80GB/s aggregate
   7 PCIe x8 slots and 4 PCIe x16
   8GB/s/slot, 56 GB/s aggregate PCIe x8 bandwidth

IO controllers
   4 x LSI SAS 9285-8e 8 port SAS, 800MHz dual core ROC, 1GB cache
   3 x NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter

JBOD enclosures
   16 x LSI 620J 2U 24 x 2.5" bay SAS 6Gb/s, w/SAS expander
   2 SFF 8088 host and 1 expansion port per enclosure
   384 total SAS 6GB/s 2.5" drive bays
   Two units are daisy chained with one in each pair
      connecting to one of 8 HBA SFF8088 ports, for a total of
      32 6Gb/s SAS host connections, yielding 38.4 GB/s full duplex b/w

Disks drives
   384 HITACHI Ultrastar C15K147 147GB 15000 RPM 64MB Cache 2.5" SAS
      6Gb/s Internal Enterprise Hard Drive

Note that the HBA to disk bandwidths of 19.2GB/s one way and 38.4GB/s
full duplex are in excess of the HBA to PCIe bandwidths, 16 and 32GB/s
respectively, by approximately 20%.  Also note that each drive can
stream reads at 160MB/s peak, yielding 61GB/s aggregate streaming read
capacity for the 384 disks.  This is almost 4 times the aggregate one
way transfer rate of the 4 PCIe x8 slots, and is 6 times our target host
to parallel client data rate of 10GB/s.  There are a few reasons why
this excess of capacity is built into the system:

1.  RAID10 is the only suitable RAID level for this type of system with
this many disks, for many reasons that have been discussed before.
RAID10 instantly cuts the number of stripe spindles in two, dropping the
data rate by a factor of 2, giving us 30.5GB/s potential aggregate
throughput.  Now we're only at 3 times out target data rate.

2.  As a single disk drive's seek rate increases, its transfer rate
decreases in relation to its single streaming read performance.
Parallel streaming reads will increase seek rates as the disk head must
move between different regions of the disk platter.

3.  In relation to 2, if we assume we'll lose no more than 66% of our
single streaming performance with a multi stream workload, we're down to
10.1GB/s throughput, right at our target.

By using relatively small arrays of 24 drives each (12 stripe spindles),
concatenating (--linear) the 16 resulting arrays, and using a filesystem
such as XFS across the entire array with its intelligent load balancing
of streams using allocation groups, we minimize disk head seeking.
Doing this can in essence divide our 50 client streams across 16 arrays,
with each array seeing approximately 3 of the streaming client reads.
Each disk should be able to easily maintain 33% of its max read rate
while servicing 3 streaming reads.

I hope you found this informative or interesting.  I enjoyed the
exercise.  I'd been working on this system specification for quite a few
days now but have been hesitant to post it due to its length, and the
fact that AFAIK hardware discussion is a bit OT on this list.

I hope it may be valuable to someone Google'ing for this type of
information in the future.

-- 
Stan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html