Re: RAID performance

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Thu, 14 Feb 2013 03:17:28 +1100

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:

>On 2/11/2013 11:33 PM, Adam Goryachev wrote:
>> I'm assuming MPIO requires the following:
>> SAN must have multiple physical links over 'disconnected' networks
>>(ie, different networks) on different subnets.
>> iSCSI client must meet the same requirements.
>
>I fubar'd this.  See below for a thorough explanation.  The IPs should
>all be in the same subnet.
>
>> OK, what about this option:
>> 
>> Install dual port ethernet card into each of the 8 xen boxes
>> Install 2 x quad port ethernet card into each of the san boxes
>> 
>> Connect one port from each of the xen boxes plus 4 ports from each
>> san box to a single switch (16ports)
>> 
>> Connect the second port from each of the xen boxes plus 4 ports from
>> each san box to a second switch (16 ports)
>> 
>> Connect the motherboard port (existing) from each of the xen boxes
>> plus one port from each of the SAN boxes (management port) to a single
>> switch (10 ports).
>> 
>> Total of 42 ports.
>> 
>> Leave the existing motherboard port configured with existing
>IP's/etc,
>> and dedicate this as the management/user network (RDP/SMB/etc).
>
>Keeping the LAN and SAN traffic on different segments is a big plus.
>But I still wonder if a single link for SMB traffic is enough for that
>greedy bloke moving 50GB files over the network.
>
>> We then configure the SAN boxes with two bond devices, each
>> consisting of a set of 4 x 1Gbps as balance-alb, with one IP
>> address each (from 2 new subnets).
>
>Use MPIO (multipath) only.  Do not use channel bonding.  MPIO runs
>circles around channel bonding.  Read this carefully.  I'm pretty sure
>you'll like this.
>
>Ok, so this fibre channel guy has been brushing up a bit on iSCSI
>multipath, and it looks like the IP subnetting is a non issue, and
>after
>thinking it through I've put palm to forehead.  As long as you have an
>IP path between ethernet ports, the network driver uses the MAC address
>from that point forward, DUH!.  Remember that IP addresses exist solely
>for routing packets from one network to another.  But within a network
>the hardware address is used, i.e. the MAC address.  This has been true
>for the 30+ years of networking.  Palm to forehead again. ;)
>
>So, pick a unique subnet for SAN traffic, and assign an IP and
>appropriate mask to each physical port in all the machines' iSCSI
>ports.

There are a couple of problems I'm having with this solution.

I've created 8 IP's (one on each eth interface on san1, all in the same subnet /24) and another 2 IP's on xen1 for it's 2 eth interfaces, again on the same subnet.

The initial reason I knew this wouldn;t work is that linux will see the arp request (broadcast) and respond from any interface. For example, from xen1 I ping each IP on san1, then do a arp -an and see the same MAC address for every IP (or sometimes I will see two or even three different MAC addresses, but mostly the same. I do arp -d in between to restart the test clean each time).

So, after some initial reading (since I remembered from years ago this was solvable) I did:
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

Now I get a unique MAC for each IP (perfect)

What about balancing the reverse traffic. 
Assuming I set the same options on xen1, then san1 will reply to the different IP's on different ethernet ports for xen1.
Except, when xen1 sends any TCP request to any of san1 IP's, it will always come from the same IP, so san1 will always respond to the same IP. This was why I suggested using two distinct subnets/LAN's/switches. 

This means I'm limited to only 1Gbps outbound, and therefore only 1Gbps inbound since the san1 will always reply to the same IP.....

If I use two subnets, multipath will use the first ethernet interface/IP to talk to the san1 on it's first IP and the second ethernet/ip to talk to san1 second subnet/ip

I thought to use bonding with balance-alb for the 4 ports on san1 so that it will use a max of 1 out of 4 ports to talk to one client (avoid overloading the client) and also dynamically balance all clients for inbound/outbound with fancy arp announcing.

> The rest of the iSCSI setup you already know how to do.  The only
>advice I can give you here is to expose every server target LUN out
>every physical port so the Xen box ports see the LUNs on every server
>port.  I assume you've already done this with the current IP subnet, as
>it's required to live migrate your VMs amongst the Xen servers.  So you
>just need to change it over for the new SAN specific subnet/ports. 
>Now,
>when you run 'multipath -ll' on each Xen box it'll see all the LUNs on
>all 8 ports of each iSCSI server (as well as local disks), and
>automatically do round robin fanning of SCSI block IO packets across
>all
>8 server ports.  You may need to blacklist local devices.  You'll
>obviously want to keep the LUNs on the standby iSCSI server masked, or
>simply not used until needed.
>
>You only install the multipath driver on the initiators (Xen clients),
>--NOT ON THE TARGETS (servers)-- .  All block IO transactions are
>initiated by the client (think desktop PC with single SATA drive--who
>talks first, mobo or drive?).  The iSCSI server will always reply on
>the
>port a packet arrived on.  So, you get automatic perfect block IO
>scaling on all server ports, all the time, no matter how many clients
>are talking.  Told ya you'd like this. ;)  Here's a relatively
>informative read on multipath:
>
>http://linfrastructure.blogspot.com/2008/02/multipath-and-equallogic-iscsi.html

OK, so I need to use iscsiadm and bind it to the individual mac/interface... now I see how this will work

>> Add a "floating" IP to the current primary SAN on each of the bond
>> interfaces from the new subnets.
>No, see above.

>> We configure each of the xen boxes with two new ethernets with one IP
>> address each (from the 2 new subnets).
>> 
>> Configure multipath to talk to the two floating IP's

So, to do the failover, I just need to stop ietd on san1, and start ietd on san2.... as long as I did a discovery at some point while san2 was running so that it knows that is a possible path to the devices...

>> I'm thinking 200MB/s should be enough performance for any one machine
>> disk access, and 1Gbps for any single user side network access should
>be
>> ample given this is the same as what they had previously.
>
>Coincidentally, the last 'decent' size network I managed had 525'ish
>users, but our 4 Citrix servers were bare metal blades.  All our CIFS
>traffic hit a single blade's GbE port.  That blade, ESX3, hosted our DC
>file server VM and 6 other Linux and Windows VMs, some of which had
>significant traffic.  User traffic HBA was single GbE, and the SAN HBA
>was 2Gb/s fibre channel.  Same bandwidth as your soon-to-be setup.
>Though the backend was different, one FasTt600 and one SataBlade, each
>with a single 2Gb FC link.

I'll bet it was a lot neater too :)

>> The only question left is what will happen when there is only one xen
>> box asking to read data from the SAN? Will the SAN attempt to send
>the
>> data at 8Gbps, flooding the 2Gbps that the client can handle, and
>
>You're not using balance-rr inappropriately here.  So, no, this isn't
>an
>issue.  In a request/reply chain, responses will only go out at the
>same
>rate requests come in.  With two ports making requests to 8 ports, each
>of the 8 will receive and reply to 1/4th of the total requests,
>generating ~1/4th of the total bandwidth.  200/8=25, so each of the 8
>ports will transmit ~25MB/s in replies.  Reply packets will be larger
>due to the data payload, but the packet queuing, window scaling, etc in
>the receiver TCP stack FOR EACH PORT will slow down the sender when
>necessary.
>
>And you're now wondering why TCP packet queuing didn't kick in with
>balance-rr, causing all of those ethernet pause frames and other
>issues.
> Answer:  I think the problem was that the TCP back off features were
>short circuited.  When you were using balance-rr, packets were likely
>arriving wildly out of sequence, from the same session, but from
>different MAC addresses.  You were sending from one IP stack out 4 MACs
>to one 1 IP stack on one MAC.
>
>With multipath iSCSI, each MAC has its own IP address and own TCP
>stack,
>so all packets always arrive in order or are easily reordered, allowing
>packet queuing, window scaling, etc, to work properly.  Balance-rr
>works
>in the cluster scenario up top because the packets still arrive in
>sequence, even though on different ports from different MACs.  Say
>1/2/3/4, 5/6/7/8, etc.  Previously you probably had packet ordering
>something like 4/1/3/2/7/8/6/5 on occasion.  This short circuited the
>receiving TCP stack preventing it from sending back offs.  The TCP
>stack
>on the server thought all was fine and kept slinging packets until the
>switch started sending back ethernet pause frames.  I'm not enough of
>an
>ethernet or TCP expert to explain what happens next, but I'd bet those
>Windows write errors are related to this.
>
>Again, ya don't have to worry about any of this mess using multipath.
>And, you get full port balanced bandwidth on all Xen hosts, and all
>server ports, all the time.  Pretty slick.
>
>> generate all the pause messages, or is this not relevant and it will
>> "just work". Actually, I think from reading the docs, it will only
>use
>> one link out of each group of 4 to send the data, hence it won't
>attempt
>> to send at more than 2Gbps to each client....
>
>See above.

I'm not confident, but will give it a go and see.... since we will send one small request to each of the 8 san1 IP's, and each of those can reply at 1Gbps, and the reply will be bigger than the request (reads). Though I suppose we won't submit the next read request until after we get the first one, so perhaps this will keep things under control .... I'll let you know how it goes....

>> I don't think this system will scale any further than this, I can
>only
>> add additional single Gbps ports to the xen hosts, and I can only add
>> one extra 4 x 1Gbps ports to each SAN server.... Best case is add 4 x
>> 10Gbps to the SAN, 2 single 1Gbps ports to each xen, providing a full
>> 32Gbps to the clients, each client gets max 4Gbps. In any case, I
>think
>> that would be one kick-ass network, besides being a pain to try and
>> debug, keep cabling neat and tidy, etc... Oh, and the current SSD's
>> wouldn't be that fast... At 400MB/s read, times 7 data disks is
>> 2800GB/s, actually, damn, that's fast.
>
>You could easily get by with a single quad port NIC on iSCSI target
>duty
>in each server.  That's 800MB/s duplex, same as 4Gb fibre channel.
>That's more than sufficient for your 8 Xen nodes, especially given the
>bulk of your traffic is SMB, which is limited to ~100MB/s, which means
>~100MB/s on the SAN, with ~300MB/s breathing room.  Use the other quad
>port NIC direct connected with x-over cables to the other server for
>DRBD using balance-rr.

Yes, at some point I'm going to need to increase the connection for DRBD from the current 1Gbps, but one thing at a time :)

>> The only additional future upgrade I would plan is to upgrade the
>> secondary san to use SSD's matching the primary. Or add additional
>SSD's
>> to expand storage capacity and I guess speed. I may also need to add
>> additional ethernet ports to both SAN1 and SAN2 to increase the DRBD
>> cross connects, but these would I assume be configured using linux
>> bonding in balance-rr since there is no switch in between.
>
>See above.
>And more than sufficient hardware.  I was under the impression that
>this much capital was not available, or I'd had different
> recommendations.
>One being very similar to what you came up with here.

So did I, until they said "Just fix it, whatever you need...." so that's when I had to make sure to purchase everything I might need in one go, and make sure it would work the first time .....

>Two words:  Murphy's law

Thanks, I thought I was in trouble after installing the equipment into san1, there was no keyboard, eventually I pulled both quad port ethernets, still keyboard was really unreliable, pulled the new sata controller, same thing.... Eventually tried a different keyboard, and it was perfect.... I can't believe a USB keyboard would fail right in the middle of a major upgrade. Thankfully there was a spare USB keyboard in it's box available or I might have taken me hours longer to sort it out!

>> No problem, this has been a definite learning experience for me and I
>> appreciate all the time and effort you've put into assisting.
>
>There are millions of folks spewing vitriol at one another at any
>moment
>on the net.  I prefer to be constructive, help people out when I can,
>pass a little knowledge and creativity when possible, learn things
>myself.  That's not to say I don't pop at folks now and then when
>frustration boils. ;)  I'm human too.

Absolutely agree with all that :)

>> BTW, I went last night (monday night) and removed one dual port card
>> from the san2, installed into the xen host running the DC VM.
>Configured
>> the two new ports on the xen box as active-backup (couldn't get LACP
>to
>> work since the switch only supports max of 4 LAG's anyway). Removed
>one
>> port from the LAG on san1, and setup the three ports (1 x san + 2 x
>> xen1) as a VLAN with private IP address on a new subnet.  Today,
>> complaints have been non-existant, mostly relating to issues they had
>> yesterday but didn't bother to call until today. It's now 4:30pm, so
>I'm
>> thinking that the problem is solved just with that done.
>
>So the biggest part of the problem was simply SMB and iSCSI on the same
>link on the DC.  Let's see how the new system does with that user in
>need of a clue stick, doing his 50GB SMB xfer when all users are
>humming away.

Well, I'm onsite now, in progress, got 3 hours to finish (by 7am) so better go and get it sorted!

>I appreciate the fact you posted the topic.  I had to go (re)learn a
>little bit myself.  I just hope read you this email before trying to
>stick MPIO on top of a channel bond. ;)

Only just in time, I had already installed some of the equipment when I got this.... 

>Send me a picture of the racked gear when it's all done, front, and
>back, so I can see how ugly that Medusa is, and remind myself of one of
>many reasons I prefer fibre channel. ;)

It is a real mess, I forgot to order extra cables, so pulled random lengths of second hand cables from the spares cupboard here... there are no more spares now, but I think they are all working.... Will send some pics, but I will need to come back another time with some new cables, cable ties, and some sort of cable labelling equipment to fix this up!

Thanks again, off to finish implementing now

Regards,
Adam

--
Adam Goryachev
Website Managers
www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html