On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote: > Is this only an issue with multi-chassis cabled NUMA systems such as > Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445) > with their relatively low direct node-node bandwidth, or is this also of > concern with single chassis systems with relatively much higher > node-node bandwidth, such as the AMD Opteron systems, specifically the > newer G34, which have node-node bandwidth of 19.2GB/s bidirectional? Just do your math. Buffered I/O will do two memory copies - a copy_to_user into the pagecache and DMA from the pagecache to the device (yes, that's also a copy as far as the memory subsystem is concerned, even if it is access from the device). So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual data alone. Add to that other system activity and metadata. Wether you hit the interconnect or not depends on your memory configuration, I/O attachment, and process locality. If you have all memory that the process uses and all I/O on one node you won't hit the interconnect at all, but depending on memory placement and storage attachment you might hit it twice: - userspace memory on node A to pagecache on node B to device on node C (or A again for that matter). In short you need to review your configuration pretty carefully. With direct I/O it's a lot easier as you save a copy. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html