Christoph Hellwig put forth on 3/18/2011 9:05 AM: Thanks for the confirmations and explanations. > The kernel is pretty smart in placement of user and page cache data, but > it can't really second guess your intention. With the numactl tool you > can help it doing the proper placement for you workload. Note that the > choice isn't always trivial - a numa system tends to have memory on > multiple nodes, so you'll either have to find a good partitioning of > your workload or live with off-node references. I don't think > partitioning NFS workloads is trivial, but then again I'm not a > networking expert. Bringing mdraid back into the fold, I'm wondering what kinda of load the mdraid threads would place on a system of the caliber needed to push 10GB/s NFS. Neil, I spent quite a bit of time yesterday spec'ing out what I believe is the bare minimum AMD64 based hardware needed to push 10GB/s NFS. This includes: 4 LSI 9285-8e 8port SAS 800MHz dual core PCIE x8 HBAs 3 NIAGARA 32714 PCIe x8 Quad Port Fiber 10 Gigabit Server Adapter This gives us 32 6Gb/s SAS ports and 12 10GbE ports total, for a raw hardware bandwidth of 20GB/s SAS and 15GB/s ethernet. I made the assumption that RAID 10 would be the only suitable RAID level due to a few reasons: 1. The workload being 50+ NFS large file reads of aggregate 10GB/s, yielding a massive random IO workload at the disk head level. 2. We'll need 384 15k SAS drives to service a 10GB/s random IO load 3. We'll need multiple "small" arrays enabling multiple mdraid threads, assuming a single 2.4GHz core isn't enough to handle something like 48 or 96 mdraid disks. 4. Rebuild times for parity raid schemes would be unacceptably high and would eat all of the CPU the rebuild thread would run on To get the bandwidth we need and making sure we don't run out of controller chip IOPS, my calculations show we'd need 16 x 24 drive mdraid 10 arrays. Thus, ignoring all other considerations momentarily, a dual AMD 6136 platform with 16 2.4GHz cores seems suitable, with one mdraid thread per core, each managing a 24 drive RAID 10. Would we then want to layer a --linear array across the 16 RAID 10 arrays? If we did this, would the linear thread bottleneck instantly as it runs on only one core? How many additional memory copies (interconnect transfers) are we going to be performing per mdraid thread for each block read before the data is picked up by the nfsd kernel threads? How much of each core's cycles will we consume with normal random read operations assuming 10GB/s of continuous aggregate throughput? Would the mdraid threads consume sufficient cycles that when combined with network stack processing and interrupt processing, that 16 cores at 2.4GHz would be insufficient? If so, would bumping the two sockets up to 24 cores at 2.1GHz be enough for the total workload? Or, would we need to move to a 4 socket system with 32 or 48 cores? Is this possibly a situation where mdraid just isn't suitable due to the CPU, memory, and interconnect bandwidth demands, making hardware RAID the only real option? And if it does requires hardware RAID, would it be possible to stick 16 block devices together in a --linear mdraid array and maintain the 10GB/s performance? Or, would the single --linear array be processed by a single thread? If so, would a single 2.4GHz core be able to handle an mdraid --leaner thread managing 8 devices at 10GB/s aggregate? Unfortunately I don't currently work in a position allowing me to test such a system, and I certainly don't have the personal financial resources to build it. My rough estimate on the hardware cost is $150-200K USD. The 384 Hitachi 15k SAS 146GB drives at $250 each wholesale are a little over $90k. It would be really neat to have a job that allowed me to setup and test such things. :) -- Stan -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html