hum, i think you have all to work with mdraid and hardware,right? xfs allocation groups is nice, i don´t know what workload it could accept maybe with raid0 linear this work better than stripe (i must test) i think you know what you do =) any more doubt? 2011/3/21 Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx>: > Keld Jørn Simonsen put forth on 3/20/2011 9:44 PM: > >> Are you then building the system yourself, and running Linux MD RAID? > > No. These specifications meet the needs of Matt Garman's analysis > cluster, and extend that performance from 6GB/s to 10GB/s. Christoph's > comments about 10GB/s throughput with XFS on large CPU count Altix 4000 > series machines from a few years ago prompted me to specify a single > chassis multicore AMD Opteron based system that can achieve the same > throughput at substantially lower cost. > >> Anyway, with 384 spindles and only 50 users, each user will have in >> average 7 spindles for himself. I think much of the time this would mean >> no random IO, as most users are doing large sequential reading. >> Thus on average you can expect quite close to striping speed if you >> are running RAID capable of striping. > > This is not how large scale shared RAID storage works under a > multi-stream workload. I thought I explained this in sufficient detail. > Maybe not. > >> I am puzzled about the --linear concatenating. I think this may cause >> the disks in the --linear array to be considered as one spindle, and thus >> no concurrent IO will be made. I may be wrong there. > > You are puzzled because you are not familiar with the large scale > performance features built into the XFS filesystem. XFS allocation > groups automatically enable large scale parallelism on a single logical > device comprised of multiple arrays or single disks, when configured > correctly. See: > > http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Allocation_Groups.html > > The storage pool in my proposed 10GB/s NFS server system consists of 16 > RAID10 arrays comprised of 24 disks of 146GB capacity, 12 stripe > spindles per array, 1.752TB per array, 28TB total raw. Concatenating > the 16 array devices with mdadm --linear creates a 28TB logical device. > We format it with this simple command, not having to worry about stripe > block size, stripe spindle width, stripe alignment, etc: > > ~# mkfs.xfs -d agcount=64 > > Using this method to achieve parallel scalability is simpler and less > prone to configuration errors when compared to multi-level striping, > which often leads to poor performance and poor space utilization. With > 64 XFS allocation groups the kernel can read/write 4 concurrent streams > from/to each array of 12 spindles, which should be able to handle this > load with plenty of headroom. This system has 32 SAS 6G channels, each > able to carry two 300MB/s streams, 19.8GB/s aggregate, substantially > more than our 10GB/s target. I was going to state that we're limited to > 10.4GB/s due to the PCIe/HT bridge to the processor. However, I just > realized I made an error when specifying the DL585 G7 with only 2 > processors. See [1] below for details. > > Using XFS in this manner allows us to avoid nested striped arrays and > the inherent problems associated with them. For example, in absence of > using XFS allocation groups to get our parallelism, we could do the > following: > > 1. Width 16 RAID0 stripe over width 12 RAID10 stripe > 2. Width 16 LVM stripe over width 12 RAID10 stripe > > In either case, what is the correct/optimum stripe block size for each > level when nesting the two? The answer is that there really aren't > correct or optimum stripe sizes in this scenario. Writes to the top > level stripe will be broken into 16 chunks. Each of these 16 chunks > will then be broken into 12 more chunks. You may be thinking, "Why > don't we just create one 384 disk RAID10? It would SCREAM with 192 > spindles!!" There are many reasons why nobody does this, one being the > same stripe block size issue as with nested stripes. Extremely wide > arrays have a plethora of problems associated with them. > > In summary, concatenating many relatively low stripe spindle count > arrays, and using XFS allocation groups to achieve parallel scalability, > gives us the performance we want without the problems associated with > other configurations. > > > [1] In order to get all 11 PCIe slots in the DL585 G7 one must use the > 4 socket model, as the additional PCIe slots of the mezzanine card > connect to two additional SR5690 chips, each one connecting to an HT > port on each of the two additional CPUs. Thus, I'm re-specifying the > DL585 G7 model to have 4 Opteron 6136 CPUs instead of two, 32 cores > total. The 128GB in 16 RDIMMs will be spread across all 16 memory > channels. Memory bandwidth thus doubles to 160GB/s and interconnect b/w > doubles to 320GB/s. Thus, we now have up to 19.2 GB/s of available one > way disk bandwidth as we're no longer limited by a 10.4GB/s HT/PCIe > link. Adding the two required CPUs may have just made this system > capable of 15GB/s NFS throughput for less than $5000 additional cost, > not due to the processors, but the extra IO bandwidth enabled as a > consequence of their inclusion. Adding another quad port 10 GbE NIC > will take it close to 20GB/s NFS throughput. Shame on me for not > digging far deeper into the DL585 G7 docs. > > -- > Stan > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Roberto Spadim Spadim Technology / SPAEmpresarial -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html