Re: high throughput storage server?

Christoph Hellwig <hch@xxxxxxxxxxxxx> · Mon, 14 Mar 2011 08:47:33 -0400

On Mon, Mar 14, 2011 at 07:27:00AM -0500, Stan Hoeppner wrote:
> Is this only an issue with multi-chassis cabled NUMA systems such as
> Altix 4000/UV and the (discontinued) IBM x86 NUMA systems (x440/445)
> with their relatively low direct node-node bandwidth, or is this also of
> concern with single chassis systems with relatively much higher
> node-node bandwidth, such as the AMD Opteron systems, specifically the
> newer G34, which have node-node bandwidth of 19.2GB/s bidirectional?

Just do your math.  Buffered I/O will do two memory copies - a
copy_to_user into the pagecache and DMA from the pagecache to the device
(yes, that's also a copy as far as the memory subsystem is concerned,
even if it is access from the device).

So to get 10GB/s throughput you spends 20GB/s on memcpys for the actual
data alone.  Add to that other system activity and metadata.  Wether you
hit the interconnect or not depends on your memory configuration, I/O
attachment, and process locality.  If you have all memory that the
process uses and all I/O on one node you won't hit the interconnect at
all, but depending on memory placement and storage attachment you might
hit it twice:

 - userspace memory on node A to pagecache on node B to device on node
   C (or A again for that matter).

In short you need to review your configuration pretty carefully.  With
direct I/O it's a lot easier as you save a copy.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html