Re: high throughput storage server?

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Mon, 21 Mar 2011 09:18:57 -0500

Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:

> Are you then building the system yourself, and running Linux MD RAID?

No.  These specifications meet the needs of Matt Garman's analysis
cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
comments about 10GB/s throughput with XFS on large CPU count Altix 4000
series machines from a few years ago prompted me to specify a single
chassis multicore AMD Opteron based system that can achieve the same
throughput at substantially lower cost.

> Anyway, with 384 spindles and only 50 users, each user will have in
> average 7 spindles for himself. I think much of the time this would mean 
> no random IO, as most users are doing large sequential reading. 
> Thus on average you can expect quite close to striping speed if you
> are running RAID capable of striping. 

This is not how large scale shared RAID storage works under a
multi-stream workload.  I thought I explained this in sufficient detail.
 Maybe not.

> I am puzzled about the --linear concatenating. I think this may cause
> the disks in the --linear array to be considered as one spindle, and thus
> no concurrent IO will be made. I may be wrong there.

You are puzzled because you are not familiar with the large scale
performance features built into the XFS filesystem.  XFS allocation
groups automatically enable large scale parallelism on a single logical
device comprised of multiple arrays or single disks, when configured
correctly.  See:

http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html

The storage pool in my proposed 10GB/s NFS server system consists of 16
RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
spindles per array, 1.752TB per array, 28TB total raw.  Concatenating
the 16 array devices with mdadm --linear creates a 28TB logical device.
 We format it with this simple command, not having to worry about stripe
block size, stripe spindle width, stripe alignment, etc:

~# mkfs.xfs -d agcount=64

Using this method to achieve parallel scalability is simpler and less
prone to configuration errors when compared to multi-level striping,
which often leads to poor performance and poor space utilization.  With
64 XFS allocation groups the kernel can read/write 4 concurrent streams
from/to each array of 12 spindles, which should be able to handle this
load with plenty of headroom.  This system has 32 SAS 6G channels, each
able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
more than our 10GB/s target.  I was going to state that we're limited to
10.4GB/s due to the PCIe/HT bridge to the processor.  However, I just
realized I made an error when specifying the DL585 G7 with only 2
processors.  See [1] below for details.

Using XFS in this manner allows us to avoid nested striped arrays and
the inherent problems associated with them.  For example, in absence of
using XFS allocation groups to get our parallelism, we could do the
following:

1.  Width 16 RAID0 stripe over width 12 RAID10 stripe
2.  Width 16 LVM   stripe over width 12 RAID10 stripe

In either case, what is the correct/optimum stripe block size for each
level when nesting the two?  The answer is that there really aren't
correct or optimum stripe sizes in this scenario.  Writes to the top
level stripe will be broken into 16 chunks.  Each of these 16 chunks
will then be broken into 12 more chunks.  You may be thinking, "Why
don't we just create one 384 disk RAID10?  It would SCREAM with 192
spindles!!"  There are many reasons why nobody does this, one being the
same stripe block size issue as with nested stripes.  Extremely wide
arrays have a plethora of problems associated with them.

In summary, concatenating many relatively low stripe spindle count
arrays, and using XFS allocation groups to achieve parallel scalability,
gives us the performance we want without the problems associated with
other configurations.

[1]  In order to get all 11 PCIe slots in the DL585 G7 one must use the
4 socket model, as the additional PCIe slots of the mezzanine card
connect to two additional SR5690 chips, each one connecting to an HT
port on each of the two additional CPUs.  Thus, I'm re-specifying the
DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
total.  The 128GB in 16 RDIMMs will be spread across all 16 memory
channels.  Memory bandwidth thus doubles to 160GB/s and interconnect b/w
doubles to 320GB/s.  Thus, we now have up to 19.2 GB/s of available one
way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
link.  Adding the two required CPUs may have just made this system
capable of 15GB/s NFS throughput for less than $5000 additional cost,
not due to the processors, but the extra IO bandwidth enabled as a
consequence of their inclusion.  Adding another quad port 10 GbE NIC
will take it close to 20GB/s NFS throughput.  Shame on me for not
digging far deeper into the DL585 G7 docs.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html