Re: RAID performance

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Mon, 11 Feb 2013 08:57:44 +1100

On 11/02/13 04:19, Mikael Abrahamsson wrote:
> On Mon, 11 Feb 2013, Adam Goryachev wrote:
> 
>> Nope, I'm saying that on 5 different (specifically machines 1, 4, 5,
>> 6, 7) physical boxes, (the xen host) if I do a dd
>> if=/dev/disk/by-path/iscsivm1 of=/dev/null on 5 machines concurrently,
>> then they only get 20Mbps each. If I do one at a time, I get 130Mbps,
>> if I do two at a time, I get 60Mbps, etc... If I do the same test on
>> machines 1, 2, 3, 8 at the same time, each gets 130Mbps
> 
> When you say Mbps, I read that as Megabit/s. Are you in fact referring
> to megabyte/s?

Ooops, my mistake, yes, I meant MB/s for these results, because that is
what dd provides output as.

> I suspect the load balancing (hasing) function on the switch terminating
> the LAG is causing your problem. Typically this hashing function doesn't
> look at load on individual links, but a specific src/dst/port hash
> points to a certain link, and there isn't really anything you can do
> about it. The only way around it is to go 10GE instead of the LAG, or
> move away from the LAG and assign 4 different IPs, one per physical
> link, and then make sure routing to/from server/client always goes onto
> the same link, cutting worst-case down to two servers sharing one link
> (8 servers, 4 links).

Given the flat topology, I think it is difficult (not impossible) to
ensure that both inbound and outbound traffic will be sent/received on
the correct interface. Since the route TO any of the 8 destinations is
on the same network, linux would choose the lowest numbered interface
(AFAIK) for all outbound traffic. Getting the right outbound interface
is the first issue, once solved, ensuring that each interface will only
send an ARP reply for its own IP is the second issue. Both of these are
solvable...

However, this adds lots of complexity, and this system is supposed to
allow heartbeat to automatically move the 'floating' IP to the secondary
server on failure, which certainly adds some complications there also.
It'd be nice to avoid all that, but if that is what is needed, then I'll
have to address all that.

>> The problem is that (from my understanding) LACP will balance the
>> traffic based on the destination MAC address, by default. So the
>> bandwidth between any two machines is limited to a single 1Gbps link.
>> So regardless of the number of ethernet ports on the DC box, it will
>> only ever use a max of 1Gb[s to talk to the iSCSI server.
> 
> LACP is a way to set up a bunch of ports in a channel. It doesn't affect
> how traffic will be shared, that is a property of the hardware/software
> mix in the switch/operating (LACP is control plane, it's not forwarding
> plane). Device egressing the packet onto a link decides what port it
> goes out of, typically done on properties on L2, L3 and L4 (different
> for different devices).
> 
>> However, if I configure Linux to use xmit_hash_policy=1 it will use
>> the IP address and port (layer 3+4) to decide which trunk to use. It
>> will still only use 1Gbps to talk to that IP:port combination.
> 
> As expected. You do not want to send packets belonging to a single
> "session" out different ports, because then you might get packet
> reordering. This is called "per-packet load sharing", if it's desireable
> then it might be possible to enable in the equipment. TCP doesn't like
> it though, don't know how storage protocols react.

Hmmm, so from my reading, it seems that out of order packets will never
be received by the SAN, since the sender only has 1Gbps, and the switch
will only deliver the data over one port anyway.

However, the clients (8 physical machines) would certainly receive out
of order packets, since the SAN is sending over 4 x 1Gbps of data, and
the switch is delivering this too fast to the single 1Gbps port, and so
probably add some packet loss when queues fill up, and this would slow
everything down.

I see a kernel option net.ipv4.tcp_reordering, would setting this value
to a higher figure allow me to use RR for the bonded connections, even
if the server has more total bandwidth than the recipient?

If I use a 10G connection for the SAN, and multiple 1G connections for
the clients, then I will still end up with a max of 1G read speed, since
the switch will only deliver data on a single port. So to get better
than 1G speed, I must use higher bandwidth channels, but using 10G on
all machines allows a single server to "flood" the network...

I suppose accepting max performance of 100MB/s for any individual client
could be acceptable, and if I could ensure that each client would
connect over a distinct port, I could drop in 2 x 4port ethernet devices
to the SAN, but I suspect this won't work because either the switch or
Linux will not properly balance the traffic. Potentially, I could
manually configure the MAC address on the clients, leave Linux to use
MAC based routing, such that the custom MAC address will calculate a
unique port for each. That just leaves the switch sending traffic back
to the SAN, and I don't know how I would go about that... Perhaps it
uses the source MAC address to decide the destination trunk, which will
either work because of the first fix above, or not work because of the
first fix above (if the calculations on Linux are different to the
switch)...

I'm still at a loss on how to correctly configure my network to solve
these issues, any hints would be appreciated.

Regards,
Adam

-- 
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html