Re: RAID performance

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 13 Feb 2013 01:56:09 -0600

On 2/11/2013 11:33 PM, Adam Goryachev wrote:

> I'm assuming MPIO is Multi Path IO (ie, MultiPath iSCSI)?

Yes.  Shorter to type. ;)

> I assume this means that if you have a quad port card in each machine,
> with a single ethernet connected to each of 4 switches, then you can do
> balance-rr because bandwidth on both endpoints is equal ? 

It's not simply that bandwidth is equal, but that the ports are
symmetrical.  Every source port on a host has only one path and one
destination port on another host.  It's identical to using crossover
cables between hosts, but the multiple independent switches allow for
more hosts to participate than could otherwise if using crossover cables.

> As mentioned, this cuts off the iSCSI from the rest of the 6 xen boxes.

Palm, meet forehead.  I forgot you were using iSCSI for anything other
than live migrating the DC VM amongst the Xen hosts.

> I'm assuming MPIO requires the following:
> SAN must have multiple physical links over 'disconnected' networks (ie,
> different networks) on different subnets.
> iSCSI client must meet the same requirements.

I fubar'd this.  See below for a thorough explanation.  The IPs should
all be in the same subnet.

> OK, what about this option:
> 
> Install dual port ethernet card into each of the 8 xen boxes
> Install 2 x quad port ethernet card into each of the san boxes
> 
> Connect one port from each of the xen boxes plus 4 ports from each san
> box to a single switch (16ports)
> 
> Connect the second port from each of the xen boxes plus 4 ports from
> each san box to a second switch (16 ports)
> 
> Connect the motherboard port (existing) from each of the xen boxes plus
> one port from each of the SAN boxes (management port) to a single switch
> (10 ports).
> 
> Total of 42 ports.
> 
> Leave the existing motherboard port configured with existing IP's/etc,
> and dedicate this as the management/user network (RDP/SMB/etc).

Keeping the LAN and SAN traffic on different segments is a big plus.
But I still wonder if a single link for SMB traffic is enough for that
greedy bloke moving 50GB files over the network.

> We then configure the SAN boxes with two bond devices, each consisting
> of a set of 4 x 1Gbps as balance-alb, with one IP address each (from 2
> new subnets).

Use MPIO (multipath) only.  Do not use channel bonding.  MPIO runs
circles around channel bonding.  Read this carefully.  I'm pretty sure
you'll like this.

Ok, so this fibre channel guy has been brushing up a bit on iSCSI
multipath, and it looks like the IP subnetting is a non issue, and after
thinking it through I've put palm to forehead.  As long as you have an
IP path between ethernet ports, the network driver uses the MAC address
from that point forward, DUH!.  Remember that IP addresses exist solely
for routing packets from one network to another.  But within a network
the hardware address is used, i.e. the MAC address.  This has been true
for the 30+ years of networking.  Palm to forehead again. ;)

So, pick a unique subnet for SAN traffic, and assign an IP and
appropriate mask to each physical port in all the machines' iSCSI ports.
 The rest of the iSCSI setup you already know how to do.  The only
advice I can give you here is to expose every server target LUN out
every physical port so the Xen box ports see the LUNs on every server
port.  I assume you've already done this with the current IP subnet, as
it's required to live migrate your VMs amongst the Xen servers.  So you
just need to change it over for the new SAN specific subnet/ports.  Now,
when you run 'multipath -ll' on each Xen box it'll see all the LUNs on
all 8 ports of each iSCSI server (as well as local disks), and
automatically do round robin fanning of SCSI block IO packets across all
8 server ports.  You may need to blacklist local devices.  You'll
obviously want to keep the LUNs on the standby iSCSI server masked, or
simply not used until needed.

You only install the multipath driver on the initiators (Xen clients),
--NOT ON THE TARGETS (servers)-- .  All block IO transactions are
initiated by the client (think desktop PC with single SATA drive--who
talks first, mobo or drive?).  The iSCSI server will always reply on the
port a packet arrived on.  So, you get automatic perfect block IO
scaling on all server ports, all the time, no matter how many clients
are talking.  Told ya you'd like this. ;)  Here's a relatively
informative read on multipath:

http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html

> Add a "floating" IP to the current primary SAN on each of the bond
> interfaces from the new subnets.

No, see above.

> We configure each of the xen boxes with two new ethernets with one IP
> address each (from the 2 new subnets).
> 
> Configure multipath to talk to the two floating IP's

See above.

> See a rough sketch at:
> http://suspended.wesolveit.com.au/graphs/diagram.JPG
> I couldn't fit any detail like IP addresses without making it a complete
> mess. BTW, sw1 and sw2 I'm thinking can be the same physical switch,
> using VLAN to make them separate (although different physical switches
> adds to the reliability factor, so that is also something to think about).
> 
> Now, this provides up to 2Gbps traffic for any one host, and up to 8Gbps
> traffic in total for the SAN server, which is equivalent to 4 clients at
> full speed.

With multipath, this architecture configured as above is going to be
pretty speedy.

> It also allows for the user network to operate at a full 1Gbps for
> SMB/RDP/etc, and I could still prioritise RDP at the switch....

Prioritizing RDP is a necessity for responsiveness.  But unloading the
SAN traffic from that single interface makes a huge difference, as
you've already seen.

> I'm thinking 200MB/s should be enough performance for any one machine
> disk access, and 1Gbps for any single user side network access should be
> ample given this is the same as what they had previously.

Coincidentally, the last 'decent' size network I managed had 525'ish
users, but our 4 Citrix servers were bare metal blades.  All our CIFS
traffic hit a single blade's GbE port.  That blade, ESX3, hosted our DC
file server VM and 6 other Linux and Windows VMs, some of which had
significant traffic.  User traffic HBA was single GbE, and the SAN HBA
was 2Gb/s fibre channel.  Same bandwidth as your soon-to-be setup.
Though the backend was different, one FasTt600 and one SataBlade, each
with a single 2Gb FC link.

> The only question left is what will happen when there is only one xen
> box asking to read data from the SAN? Will the SAN attempt to send the
> data at 8Gbps, flooding the 2Gbps that the client can handle, and

You're not using balance-rr inappropriately here.  So, no, this isn't an
issue.  In a request/reply chain, responses will only go out at the same
rate requests come in.  With two ports making requests to 8 ports, each
of the 8 will receive and reply to 1/4th of the total requests,
generating ~1/4th of the total bandwidth.  200/8=25, so each of the 8
ports will transmit ~25MB/s in replies.  Reply packets will be larger
due to the data payload, but the packet queuing, window scaling, etc in
the receiver TCP stack FOR EACH PORT will slow down the sender when
necessary.

And you're now wondering why TCP packet queuing didn't kick in with
balance-rr, causing all of those ethernet pause frames and other issues.
 Answer:  I think the problem was that the TCP back off features were
short circuited.  When you were using balance-rr, packets were likely
arriving wildly out of sequence, from the same session, but from
different MAC addresses.  You were sending from one IP stack out 4 MACs
to one 1 IP stack on one MAC.

With multipath iSCSI, each MAC has its own IP address and own TCP stack,
so all packets always arrive in order or are easily reordered, allowing
packet queuing, window scaling, etc, to work properly.  Balance-rr works
in the cluster scenario up top because the packets still arrive in
sequence, even though on different ports from different MACs.  Say
1/2/3/4, 5/6/7/8, etc.  Previously you probably had packet ordering
something like 4/1/3/2/7/8/6/5 on occasion.  This short circuited the
receiving TCP stack preventing it from sending back offs.  The TCP stack
on the server thought all was fine and kept slinging packets until the
switch started sending back ethernet pause frames.  I'm not enough of an
ethernet or TCP expert to explain what happens next, but I'd bet those
Windows write errors are related to this.

Again, ya don't have to worry about any of this mess using multipath.
And, you get full port balanced bandwidth on all Xen hosts, and all
server ports, all the time.  Pretty slick.

> generate all the pause messages, or is this not relevant and it will
> "just work". Actually, I think from reading the docs, it will only use
> one link out of each group of 4 to send the data, hence it won't attempt
> to send at more than 2Gbps to each client....

See above.

> I don't think this system will scale any further than this, I can only
> add additional single Gbps ports to the xen hosts, and I can only add
> one extra 4 x 1Gbps ports to each SAN server.... Best case is add 4 x
> 10Gbps to the SAN, 2 single 1Gbps ports to each xen, providing a full
> 32Gbps to the clients, each client gets max 4Gbps. In any case, I think
> that would be one kick-ass network, besides being a pain to try and
> debug, keep cabling neat and tidy, etc... Oh, and the current SSD's
> wouldn't be that fast... At 400MB/s read, times 7 data disks is
> 2800GB/s, actually, damn, that's fast.

You could easily get by with a single quad port NIC on iSCSI target duty
in each server.  That's 800MB/s duplex, same as 4Gb fibre channel.
That's more than sufficient for your 8 Xen nodes, especially given the
bulk of your traffic is SMB, which is limited to ~100MB/s, which means
~100MB/s on the SAN, with ~300MB/s breathing room.  Use the other quad
port NIC direct connected with x-over cables to the other server for
DRBD using balance-rr.

> The only additional future upgrade I would plan is to upgrade the
> secondary san to use SSD's matching the primary. Or add additional SSD's
> to expand storage capacity and I guess speed. I may also need to add
> additional ethernet ports to both SAN1 and SAN2 to increase the DRBD
> cross connects, but these would I assume be configured using linux
> bonding in balance-rr since there is no switch in between.

See above.

> I'd rather keep all boxes with identical hardware, so that any VM can be
> run on any xen host.

Looks like you've got the right architecture for it nailed down now.

> So, the current purchase list, which the customer approved yesterday,
> and most of it should be delivered tomorrow (insufficient stock, already
> ordering from 4 different wholesalers):
> 4 x Quad port 1Gbps cards
> 4 x Dual port 1Gbps cards
> 2 x LSI HBA's (the suggested model)
> 1 x 48port 1Gbps switch (same as the current 16port, but more ports).

And more than sufficient hardware.  I was under the impression that this
much capital was not available, or I'd had different recommendations.
One being very similar to what you came up with here.

> The idea being to pull out 4 x dual port cards from san1/2 and install
> the 4 x quad port cards. Then install a single dual port card on each
> xen box. Install one LSI HBA in each san box. Use the 48 port switch to
> connect it all together.

> However, I'm going to be short 1 x quad ethernet, and 1 x sata
> controller, so the secondary san is going to be even more lacking for up
> to 2 weeks when these parts arrive, but IMHO, that is not important at
> this stage, if san1 falls over, I'm going to be screwed anyway running
> on spinning disks :) though not as screwed as being plain
> down/offline/nothing/just go home folks...

Two words:  Murphy's law

;)

> No problem, this has been a definite learning experience for me and I
> appreciate all the time and effort you've put into assisting.

There are millions of folks spewing vitriol at one another at any moment
on the net.  I prefer to be constructive, help people out when I can,
pass a little knowledge and creativity when possible, learn things
myself.  That's not to say I don't pop at folks now and then when
frustration boils. ;)  I'm human too.

> BTW, I went last night (monday night) and removed one dual port card
> from the san2, installed into the xen host running the DC VM. Configured
> the two new ports on the xen box as active-backup (couldn't get LACP to
> work since the switch only supports max of 4 LAG's anyway). Removed one
> port from the LAG on san1, and setup the three ports (1 x san + 2 x
> xen1) as a VLAN with private IP address on a new subnet.  Today,
> complaints have been non-existant, mostly relating to issues they had
> yesterday but didn't bother to call until today. It's now 4:30pm, so I'm
> thinking that the problem is solved just with that done.

So the biggest part of the problem was simply SMB and iSCSI on the same
link on the DC.  Let's see how the new system does with that user in
need of a clue stick, doing his 50GB SMB xfer when all users are humming
away.

> I was going to
> do this across all 8 boxes, using 2 x ethernet on each xen box plus one
> x ethernet on each san, producing a max of 1Gbps ethernet for each xen
> box. However, I think your suggestion of MPIO is much better, and
> grouping the SAN ports into two bundles makes a lot more sense, and
> produces 2Gbps per xen box.

Nope, no bundles.  As with balance-rr, MPIO is awesome when deployed
properly.

> Thanks again, I appreciate all the help.

I appreciate the fact you posted the topic.  I had to go (re)learn a
little bit myself.  I just hope read you this email before trying to
stick MPIO on top of a channel bond. ;)

Send me a picture of the racked gear when it's all done, front, and
back, so I can see how ugly that Medusa is, and remind myself of one of
many reasons I prefer fibre channel. ;)

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html