Re: RAID performance

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Fri, 08 Feb 2013 11:10:07 -0600

The information you've provided below seems to indicate the root cause
of the problem.  The good news is that the fix(es) are simple, and
inexpensive.

I must say, now that I understand the problem, I'm wondering why you
used 4 bonded GbE ports on your iSCSI target server, yet employed a
single GbE port on the only machine that accesses it, according to the
information you've presented.  Based on that, this is the source of your
problem.  Keep reading.

On 2/8/2013 1:11 AM, Adam Goryachev wrote:

> OK, so potentially, I may need to get a new controller board.
> Is there a test I can run which will determine the capability of the
> chipset? I can shutdown all the VM's tonight, and run the required tests...

Forget all of this.  The problem isn't with the storage server, but your
network architecture.

> From the switch stats, ports 5 to 8 are the bonded ports on the storage
> server (iSCSI traffic):
> 
> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
> 5    734007958  0         110         120729310  0         0
> 6    733085348  0         114         54059704   0         0
> 7    734264296  0         113         45917956   0         0
> 8    732964685  0         102         95655835   0         0

I'm glad I asked you for this information.  This clearly shows that the
server is performing LACP round robin fanning nearly perfectly.  It also
shows that the bulk of the traffic coming from the W2K DC, which
apparently hosts the Windows shares for TS users, is being pumped to the
storage server over port 5, the first port in the switch's bonding
group.  The switch is doing adaptive load balancing with transmission
instead of round robin.  This is the default behavior of many switches
and is fine.

> So, traffic seems reasonably well balanced across all four links

The storage server's transmit traffic is well balanced out of the NICs,
but the receive traffic from the switch is imbalanced, almost 3:1
between ports 5 and 7.  This is due to the switch doing ALB, and helps
us diagnose the problem.

> The win2k DC is on physical machine 1 which is on port 9 of the switch,
> I've included the above stats here as well:
> 
> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
> 5    734007958  0         110         120729310  0         0
> 6    733085348  0         114         54059704   0         0
> 7    734264296  0         113         45917956   0         0
> 8    732964685  0         102         95655835   0         0

> 9    1808508983 0         72998       1942345594 0         0

And here the problem is brightly revealed.  This W2K DC box on port 9
hosting the shares for the terminal services users appears to be
funneling all of your file IO to/from the storage server via iSCSI, and
to/from the terminal servers via CIFS-- all over a single GbE interface.
 Normally this wouldn't be a big problem.  But you have users copying
50GB files over the network, to terminal server machines no less.

As seen from the switch metrics, when a user does a large file copy from
a share on one iSCSI target to a share on another iSCSI target, here is
what is happening:

1.  The W2K DC share server pulls the filesystem blocks over iSCSI
2.  The storage server pushes the packets out round robin at 4x the rate
    that the DC can accept them, saturating its receive port
3.  The switch issues back offs to the server NICs during the entire
    length of the copy operation due to the 4:1 imbalance.  The server
    is so over powered with SSD and 4x GbE links this doesn't bog it
    down, but it does give us valuable information as to the problem
4.  The DC upon receiving the filesystem blocks immediately transmits
    them back to the other iSCSI target on the storage server
5.  Now the DC's transmit interface is saturated
6.  So now both Tx/Rx ports on the DC NIC are saturated
7.  Now all CIFS traffic on all terminal servers is significantly
    delayed due to congestion at the DC, causing severe lag for others
    doing file operations to/from the DC shares.
8.  If the TS/roaming profiles are on a share on this DC server
    any operation touching a profile will be slow, especially
    logon/off, as your users surely have massive profiles, given
    they save multi GB files to their desktops

> 802.3x Pause Frames Transmitted		1230476

"Bingo" metric.

> 2) The value for Pause Frames Transmitted, I'm not sure what this is,
> but it doesn't sound like a good thing....
> http://en.wikipedia.org/wiki/Ethernet_flow_control
> Seems to indicate that the switch is telling the physical machine to
> slow down sending data, and if these happen at even time intervals, then
> that is an average of one per second for the past 16 days.....

The average is irrelevant.  The switch only sends pauses to the storage
server NICs when they're transmitting more frames/sec than the single
port to which the DC is attached can forward them.  More precisely,
pauses are issued every time the buffer on switch port 9 is full when
ports 5-8 attempt to forward a frame.  The buffer will be full because
the downstream GbE NIC can't swallow the frames fast enough.  You've got
1.2 million of these pause frames logged.  This is your beacon in the
dark, shining bright light on the problem.

> I can understand that the storage server can send faster that any
> individual receiver, so I can see why the switch might tell it to slow
> down, but I don't see why the switch would tell the physical machine to
> slow down.

It's not telling the "physical machine" to "slow down".  It's telling
the ethernet device to pause between transmissions to the target MAC
address which is connected to the switch port that is under load
distress.  Your storage server isn't slowing down your terminal servers
or the users apps running on them.  Your DC is.

> So, to summarise, I think I need to look into the network performance,

You just did, and helped put the final nail in the coffin.  You simply
didn't realize it.  And you may balk at the solution, as it is so
simple, and cheap.  The problem, and the solution are:

Problem:
W2K DC handles all the client CIFS file IO traffic with the terminal
servers, as well as all iSCSI IO to/from the storage server, over a
single GbE interface.  It has a 4:1 ethernet bandwidth deficit with the
storage server alone, causing massive network congestion at the DC
machine during large file transfers.  This in turn bogs down CIFS
traffic across all TS boxen, lagging the users.

Solution:
Simply replace the onboard single port GbE NIC in the W2K DC share
server with an Intel quad port GbE NIC, and configure LACP bonding with
the switch. Use ALB instead of RR.  Using ALB will prevent the DC share
server from overwhelming the terminal servers in the same manner the
storage server is currently doing the DC.  Leave the storage server as RR.

However, this doesn't solve the problem of one user on a terminal server
bogging down everyone else on the same TS box if s/he pulls a 50GB file
to his/her desktop.  But the degradation will now be limited to only
users on that one TS box.  If you want to mitigate this to a degree, use
two bonded NIC ports in the TS boxen.  Here you can use RR transmit
without problems, as 2 ports can't saturate the 4 on the DC's new 4 port
NIC.  A 50GB transfer will take 4-5 minutes instead of the current 8-10.
 But my $deity, why are people moving 50GB files across a small biz
network for Pete's sake...  If this is an ongoing activity, you need to
look into Windows user level IO limiting so you can prevent one person
from hogging all the IO bandwidth.  I've never run into this before so
you'll have to research it.  May be a policy for it if you're lucky.
I've always handled this kinda thing with a cluestick.  On to the
solution, or at least most of it.

http://www.intel.com/content/dam/doc/product-brief/ethernet-i340-server-adapter-brief.pdf

You want the I340-T4, 4 port copper, obviously.  Runs about $250 USD,
about $50 less than the I350-T4.  It's the best 4 port copper GbE NIC
for the money with all the features you need.  You're already using 2x
I350-T2s in the server so this card will be familiar WRT driver
configuration, etc.  It's $50 cheaper than the I350-T4 but with all the
needed features.

Crap, I just remembered you're using consumer Asus boards for the other
machines.  I just checked the manual for the Asus M5A88-M and it's not
clear if anything but a graphics card can be used in the x16 slot...

So, I'd acquire one 4 port PCIe x4 Intel card, and two of these Intel 2
port x1 cards (Intel doesn't offer a 2 port x1 card):
http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/pro-1000-pt-server-adapter-brief.pdf

If the 4 port x4 card won't work, use the two single port x1 cards with
LACP ALB.  In which case you'll also want to switch the NICs on the
iSCSI server to ALB, or you'll still have switch congestion.  The 4 port
400MB/s solution would be optimal, but 200MB/s is still double what you
have now, and will help alleviate the problem, but won't eliminate it.
I hope the 4 port PCIe x4 card will work in that board.

If you must use the PCIe x1 single port cards, you could try adding a
PRO 1000 PCI NIC, and Frankenstein these 3 together with the onboard
Realtek 8111 to get 4 ports.  That's uncharted territory for me.  I
always use matching NICs, or at least all from the same hardware family
using the same driver.

I hope I've provided helpful information.

Keep us posted.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html