Re: high throughput storage server?

Keld Jørn Simonsen <keld@xxxxxxxxxx> · Mon, 21 Mar 2011 23:13:04 +0100

On Mon, Mar 21, 2011 at 09:18:57AM -0500, Stan Hoeppner wrote:
> Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM:
> 
> > Are you then building the system yourself, and running Linux MD RAID?
> 
> No.  These specifications meet the needs of Matt Garman's analysis
> cluster, and extend that performance from 6GB/s to 10GB/s.  Christoph's
> comments about 10GB/s throughput with XFS on large CPU count Altix 4000
> series machines from a few years ago prompted me to specify a single
> chassis multicore AMD Opteron based system that can achieve the same
> throughput at substantially lower cost.

OK, But I understand that this is running Linux MD RAID, and not some
hardware RAID. True?

Or at least Linux MD RAID is used to build a --linear FS.
Then why not use Linux MD to make the underlying RAID1+0 arrays?

> 
> > Anyway, with 384 spindles and only 50 users, each user will have in
> > average 7 spindles for himself. I think much of the time this would mean 
> > no random IO, as most users are doing large sequential reading. 
> > Thus on average you can expect quite close to striping speed if you
> > are running RAID capable of striping. 
> 
> This is not how large scale shared RAID storage works under a
> multi-stream workload.  I thought I explained this in sufficient detail.
>  Maybe not.

Given that the whole array system is only lightly loaded, this is how I
expect it to function. Maybe you can explain why it would not be so, if
you think otherwise.

> > I am puzzled about the --linear concatenating. I think this may cause
> > the disks in the --linear array to be considered as one spindle, and thus
> > no concurrent IO will be made. I may be wrong there.
> 
> You are puzzled because you are not familiar with the large scale
> performance features built into the XFS filesystem.  XFS allocation
> groups automatically enable large scale parallelism on a single logical
> device comprised of multiple arrays or single disks, when configured
> correctly.  See:
> 
> http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html
> 
> The storage pool in my proposed 10GB/s NFS server system consists of 16
> RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe
> spindles per array, 1.752TB per array, 28TB total raw.  Concatenating
> the 16 array devices with mdadm --linear creates a 28TB logical device.
>  We format it with this simple command, not having to worry about stripe
> block size, stripe spindle width, stripe alignment, etc:
> 
> ~# mkfs.xfs -d agcount=64
> 
> Using this method to achieve parallel scalability is simpler and less
> prone to configuration errors when compared to multi-level striping,
> which often leads to poor performance and poor space utilization.  With
> 64 XFS allocation groups the kernel can read/write 4 concurrent streams
> from/to each array of 12 spindles, which should be able to handle this
> load with plenty of headroom.  This system has 32 SAS 6G channels, each
> able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially
> more than our 10GB/s target.  I was going to state that we're limited to
> 10.4GB/s due to the PCIe/HT bridge to the processor.  However, I just
> realized I made an error when specifying the DL585 G7 with only 2
> processors.  See [1] below for details.
> 
> Using XFS in this manner allows us to avoid nested striped arrays and
> the inherent problems associated with them.  For example, in absence of
> using XFS allocation groups to get our parallelism, we could do the
> following:
> 
> 1.  Width 16 RAID0 stripe over width 12 RAID10 stripe
> 2.  Width 16 LVM   stripe over width 12 RAID10 stripe
> 
> In either case, what is the correct/optimum stripe block size for each
> level when nesting the two?  The answer is that there really aren't
> correct or optimum stripe sizes in this scenario.  Writes to the top
> level stripe will be broken into 16 chunks.  Each of these 16 chunks
> will then be broken into 12 more chunks.  You may be thinking, "Why
> don't we just create one 384 disk RAID10?  It would SCREAM with 192
> spindles!!"  There are many reasons why nobody does this, one being the
> same stripe block size issue as with nested stripes.  Extremely wide
> arrays have a plethora of problems associated with them.
> 
> In summary, concatenating many relatively low stripe spindle count
> arrays, and using XFS allocation groups to achieve parallel scalability,
> gives us the performance we want without the problems associated with
> other configurations.
> 
> 
> [1]  In order to get all 11 PCIe slots in the DL585 G7 one must use the
> 4 socket model, as the additional PCIe slots of the mezzanine card
> connect to two additional SR5690 chips, each one connecting to an HT
> port on each of the two additional CPUs.  Thus, I'm re-specifying the
> DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores
> total.  The 128GB in 16 RDIMMs will be spread across all 16 memory
> channels.  Memory bandwidth thus doubles to 160GB/s and interconnect b/w
> doubles to 320GB/s.  Thus, we now have up to 19.2 GB/s of available one
> way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe
> link.  Adding the two required CPUs may have just made this system
> capable of 15GB/s NFS throughput for less than $5000 additional cost,
> not due to the processors, but the extra IO bandwidth enabled as a
> consequence of their inclusion.  Adding another quad port 10 GbE NIC
> will take it close to 20GB/s NFS throughput.  Shame on me for not
> digging far deeper into the DL585 G7 docs.

it is probably not the concurrency of XFS that makes the parallelism of
the IO. It is more likely the IO system, and that would also work for
other file system types, like ext4. I do not see anything in the XFS allocation
blocks with any knowledge of the underlying disk structure. 
What the file system does is only to administer the scheduling of the
IO, in combination with the rest of the kernel.

Anyway, thanks for the energy and expertise that you are supplying to
this thread.

Best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html