Re: RAID performance

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Fri, 08 Feb 2013 22:09:38 -0600

Normally I'd trim post as this is so absolutely huge, but I want to keep
this thread intact for people stumbling across this via Google.  I think
it's very informative/educational to read this troubleshooting
progression, and gain the insights and knowledge contained herein.

On 2/8/2013 12:44 PM, Adam Goryachev wrote:
> On 09/02/13 04:10, Stan Hoeppner wrote:
>> The information you've provided below seems to indicate the root cause
>> of the problem.  The good news is that the fix(es) are simple, and
>> inexpensive.
>>
>> I must say, now that I understand the problem, I'm wondering why you
>> used 4 bonded GbE ports on your iSCSI target server, yet employed a
>> single GbE port on the only machine that accesses it, according to the
>> information you've presented.  Based on that, this is the source of your
>> problem.  Keep reading.
> 
> Well, because the old SAN device had 4 x Gbps ports, and I copied that,
> and I also didn't want an individual PC to flood the SAN... I guess I
> never worked out that one PC was really driving 70% of the traffic....

And that was smart thinking.  You simply didn't realize that one TS
could now flood the CIFS server.

>>> From the switch stats, ports 5 to 8 are the bonded ports on the storage
>>> server (iSCSI traffic):
>>>
>>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>>> 5    734007958  0         110         120729310  0         0
>>> 6    733085348  0         114         54059704   0         0
>>> 7    734264296  0         113         45917956   0         0
>>> 8    732964685  0         102         95655835   0         0
>>
>> I'm glad I asked you for this information.  This clearly shows that the
>> server is performing LACP round robin fanning nearly perfectly.  It also
>> shows that the bulk of the traffic coming from the W2K DC, which
>> apparently hosts the Windows shares for TS users, is being pumped to the
>> storage server over port 5, the first port in the switch's bonding
>> group.  The switch is doing adaptive load balancing with transmission
>> instead of round robin.  This is the default behavior of many switches
>> and is fine.
> 
> Is there some method to fix this on the switch? I have configured the
> switch that those 4 ports are a single LAG, which I assumed meant the
> switch would be smart enough to load balance properly... Guess I never
> checked that side of it though...

After thinking this through more thoroughly, I realize your IO server
may be doing broadcast aggregation and not round robin.  However, in
either case this is bad, as it will cause out of order packets or
duplicate packets.  Both of these are wrong for your network
architecture and will cause problems.  RR will cause TCP packets to be
reassembled out of sequence, causing extra overhead at the receiver, and
possibly errors if not reassembled in correct order.  Broadcast will
cause duplicate packets to arrive, at the receiver, which must discard
them.  Both flood the receiver's switch port.

The NIC ports on the IO server need to be configured as 802.3ad Dynamic
if using the Linux bonding driver.  If you're using the Intel driver
LACP it should be set to this as well, though the name may be different.

Round robin fanning of frames across all 4 ports evenly seems like a
good idea on paper, until you dig into the 802.3ad protocol:

http://www.ieee802.org/3/hssg/public/apr07/frazier_01_0407.pdf

Once you do you realize (for me again, as it's been a while) that any
single session will be limited by default to a single physical link of
the group.  LACP only gives increased bandwidth across links when
multiple sessions are present.  This is done to preserve proper packet
ordering per session, which is corrupted when fanning packets of a
single session across all links.  Fanning, (round robin) is only meant
to be used in multi-switch setups where each host has a NIC link to each
switch--I.e. Beowulf clusters.  In the default Dynamic mode, you don't
have the IO server flooding the DC with more packets than it can handle,
because the two hosts will be communicating over the same link(s), no
more, so bandwidth and packet volume is equal between them.

So, you need to disable RR or broadcast, whichever it is currently, on
the IO server, and switch it to Dynamic mode.  This will instantly kill
the flooding problem, stop the switch from sending PAUSE frames to the
IO server, and might eliminate the file/IO errors.  I'm not sure on this
last one, as I've not seen enough information about the errors (or the
actual errors themselves).  That said, disabling the Windows write
caching on the local drives backed by the iSCSI LUNs might fix this as
well.  It should never be left enabled in a configuration such as yours.

>>> So, traffic seems reasonably well balanced across all four links
>>
>> The storage server's transmit traffic is well balanced out of the NICs,
>> but the receive traffic from the switch is imbalanced, almost 3:1
>> between ports 5 and 7.  This is due to the switch doing ALB, and helps
>> us diagnose the problem.
> 
> The switch doesn't seem to have any setting to configure ALB or RR, or
> at least I don't know what I'm looking for.... In any case, I suppose if
> both sides of the network have equivalent bandwidth, then it should be
> OK....

Let's see, I think you listed the switch model...  yes, GS716T-200

It does stock 802.3ad static and dynamic link aggregation, dynamic by
default it appears, so standard session based streams.  This is what you
want.

Ah, here you go.  It does have port based ingress/egress rate limiting.
 So you should be able to slow down the terminal server hosts so no
single one can flood the DC.  Very nice.  I wouldn't have expected this
in this class of switch.

So, you can fix the network performance problem without expending any
money.  You'll just have on TS host and its users bogged down when
someone does a big file copy.  And if you can find a Windows policy to
limit IO per user, you can solve it completely.

That said, I'd still get two or 4 bonded ports into that DC share server
to speed things up for everyone.

>>> The win2k DC is on physical machine 1 which is on port 9 of the switch,
>>> I've included the above stats here as well:
>>>
>>> Int  PacketsRX  ErrorsRX  BroadcastRX PacketsTX  ErrorsTX  BroadcastTX
>>> 5    734007958  0         110         120729310  0         0
>>> 6    733085348  0         114         54059704   0         0
>>> 7    734264296  0         113         45917956   0         0
>>> 8    732964685  0         102         95655835   0         0
>>
>>> 9    1808508983 0         72998       1942345594 0         0
>>
>> And here the problem is brightly revealed.  This W2K DC box on port 9
>> hosting the shares for the terminal services users appears to be
>> funneling all of your file IO to/from the storage server via iSCSI, and
>> to/from the terminal servers via CIFS-- all over a single GbE interface.
>>  Normally this wouldn't be a big problem.  But you have users copying
>> 50GB files over the network, to terminal server machines no less.
>>
>> As seen from the switch metrics, when a user does a large file copy from
>> a share on one iSCSI target to a share on another iSCSI target, here is
>> what is happening:
>>
>> 1.  The W2K DC share server pulls the filesystem blocks over iSCSI
>> 2.  The storage server pushes the packets out round robin at 4x the rate
>>     that the DC can accept them, saturating its receive port
>> 3.  The switch issues back offs to the server NICs during the entire
>>     length of the copy operation due to the 4:1 imbalance.  The server
>>     is so over powered with SSD and 4x GbE links this doesn't bog it
>>     down, but it does give us valuable information as to the problem
>> 4.  The DC upon receiving the filesystem blocks immediately transmits
>>     them back to the other iSCSI target on the storage server
> 
> Another possible use case would send them off over SMB to the terminal
> server, and potentially that terminal server would send it back to the
> storage server.

Yeah, I skipped listing the CIFS client host in the traffic chain, as
once the DC is flooded all the TS servers crawl.

>> 5.  Now the DC's transmit interface is saturated
>> 6.  So now both Tx/Rx ports on the DC NIC are saturated
>> 7.  Now all CIFS traffic on all terminal servers is significantly
>>     delayed due to congestion at the DC, causing severe lag for others
>>     doing file operations to/from the DC shares.
>> 8.  If the TS/roaming profiles are on a share on this DC server
>>     any operation touching a profile will be slow, especially
>>     logon/off, as your users surely have massive profiles, given
>>     they save multi GB files to their desktops
> 
> OK, makes sense ...
> 
>>> 802.3x Pause Frames Transmitted		1230476
>> "Bingo" metric.
>>
>>> 2) The value for Pause Frames Transmitted, I'm not sure what this is,
>>> but it doesn't sound like a good thing....
>>> http://en.wikipedia.org/wiki/Ethernet_flow_control
>>> Seems to indicate that the switch is telling the physical machine to
>>> slow down sending data, and if these happen at even time intervals, then
>>> that is an average of one per second for the past 16 days.....
>>
>> The average is irrelevant.  The switch only sends pauses to the storage
>> server NICs when they're transmitting more frames/sec than the single
>> port to which the DC is attached can forward them.  More precisely,
>> pauses are issued every time the buffer on switch port 9 is full when
>> ports 5-8 attempt to forward a frame.  The buffer will be full because
>> the downstream GbE NIC can't swallow the frames fast enough.  You've got
>> 1.2 million of these pause frames logged.  This is your beacon in the
>> dark, shining bright light on the problem.
>>
>>> I can understand that the storage server can send faster that any
>>> individual receiver, so I can see why the switch might tell it to slow
>>> down, but I don't see why the switch would tell the physical machine to
>>> slow down.
>>
>> It's not telling the "physical machine" to "slow down".  It's telling
>> the ethernet device to pause between transmissions to the target MAC
>> address which is connected to the switch port that is under load
>> distress.  Your storage server isn't slowing down your terminal servers
>> or the users apps running on them.  Your DC is.
>>
>>> So, to summarise, I think I need to look into the network performance,
>>
>> You just did, and helped put the final nail in the coffin.  You simply
>> didn't realize it.  And you may balk at the solution, as it is so
>> simple, and cheap.  The problem, and the solution are:
>>
>> Problem:
>> W2K DC handles all the client CIFS file IO traffic with the terminal
>> servers, as well as all iSCSI IO to/from the storage server, over a
>> single GbE interface.  It has a 4:1 ethernet bandwidth deficit with the
>> storage server alone, causing massive network congestion at the DC
>> machine during large file transfers.  This in turn bogs down CIFS
>> traffic across all TS boxen, lagging the users.
>>
>> Solution:
>> Simply replace the onboard single port GbE NIC in the W2K DC share
>> server with an Intel quad port GbE NIC, and configure LACP bonding with
>> the switch. Use ALB instead of RR.  Using ALB will prevent the DC share
>> server from overwhelming the terminal servers in the same manner the
>> storage server is currently doing the DC.  Leave the storage server as RR.
>>
>> However, this doesn't solve the problem of one user on a terminal server
>> bogging down everyone else on the same TS box if s/he pulls a 50GB file
>> to his/her desktop.  But the degradation will now be limited to only
>> users on that one TS box.  If you want to mitigate this to a degree, use
>> two bonded NIC ports in the TS boxen.  Here you can use RR transmit
>> without problems, as 2 ports can't saturate the 4 on the DC's new 4 port
>> NIC.  A 50GB transfer will take 4-5 minutes instead of the current 8-10.
>>  But my $deity, why are people moving 50GB files across a small biz
>> network for Pete's sake...  If this is an ongoing activity, you need to
>> look into Windows user level IO limiting so you can prevent one person
>> from hogging all the IO bandwidth.  I've never run into this before so
>> you'll have to research it.  May be a policy for it if you're lucky.
>> I've always handled this kinda thing with a cluestick.  On to the
>> solution, or at least most of it.
>>
>> http://www.intel.com/content/dam/doc/product-brief/ethernet-i340-server-adapter-brief.pdf
>>
>> You want the I340-T4, 4 port copper, obviously.  Runs about $250 USD,
>> about $50 less than the I350-T4.  It's the best 4 port copper GbE NIC
>> for the money with all the features you need.  You're already using 2x
>> I350-T2s in the server so this card will be familiar WRT driver
>> configuration, etc.  It's $50 cheaper than the I350-T4 but with all the
>> needed features.
>>
>> Crap, I just remembered you're using consumer Asus boards for the other
>> machines.  I just checked the manual for the Asus M5A88-M and it's not
>> clear if anything but a graphics card can be used in the x16 slot...
>>
>> So, I'd acquire one 4 port PCIe x4 Intel card, and two of these Intel 2
>> port x1 cards (Intel doesn't offer a 2 port x1 card):
>> http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/pro-1000-pt-server-adapter-brief.pdf
>>
>> If the 4 port x4 card won't work, use the two single port x1 cards with
>> LACP ALB.  In which case you'll also want to switch the NICs on the
>> iSCSI server to ALB, or you'll still have switch congestion.  The 4 port
>> 400MB/s solution would be optimal, but 200MB/s is still double what you
>> have now, and will help alleviate the problem, but won't eliminate it.
>> I hope the 4 port PCIe x4 card will work in that board.
>>
>> If you must use the PCIe x1 single port cards, you could try adding a
>> PRO 1000 PCI NIC, and Frankenstein these 3 together with the onboard
>> Realtek 8111 to get 4 ports.  That's uncharted territory for me.  I
>> always use matching NICs, or at least all from the same hardware family
>> using the same driver.
> 
> Since I'm about to commit significant surgery on the network
> infrastructure, I might as well get this right. I did always have the
> desire to separate the iSCSI network from the SMB/user traffic network
> anyway.

Not necessary, won't gain you anything, now that you know how to
configure your current gear, or at least, that it can be configured to
meet your needs, solving your current problems.

> BTW, would I probably see improved stability (ie, reduced performance,
> but less errors) by reducing the number of ethernet ports on the storage
> server to 2 ? Not a permanent solution, but potentially a very short
> term improvement while waiting for parts....

Nope, just change the bonding mode on the IO server to standard LAPC
Dynamic, as I stated above, and this is all fixed.

> If I added the 4 port card to the DC machine, and a dual port card to
> each of the other machines, that means I have:
> 4 ports on SAN1
> 4 ports on SAN2
> 4 ports on DC
> 2 ports on each other box (7)
> 
> Total of 26 ports

Add the 4 port to the DC if it'll work in the x16 slot, if not use two
of the single port PCIe x1 NICs I mentioned and bond them in 802.3ad
Dyaminc mode, same as with the IO server.  Look into Windows TS per user
IO rate limits.  If this capability exists, limit each user to 50MB/s.

And with that, you should have fixed all the network issues.  Combined
with the changes to the IO server, you should be all squared away.

> I then need to get a new switch, a 24 port switch is not enough, and 48
> ports seems overkill. Would be nice to have a spare port for "management
> access" as well. Also I guess the switch needs to support a very busy
> network...

Unneeded additional cost and complexity.

> Move the iSCSI network to a new IP range, and dedicate these network
> interfaces for iSCSI.

Unneeded additional cost and complexity.

> I could then use the existing onboard 1Gbps ethernet on the machines for
> the user level connectivity/SMB/RDP/etc, on the existing switch/etc.
> Also, I can use the existing onboard 1G ports on the storage server for
> talking to the user level network/management/etc.
> That would free up 8 ports on the existing switch (removing the 2 x 4
> ports on SAN1/2).

Unneeded additional cost and complexity.

> This would also allow up to 1Gbps SMB data transfers between the
> machines, although I suppose a single TS can consume 100% of the DC
> bandwidth, but I think this is not unusual, and should work OK if
> another TS wants to do some small transfer at the same time.

Already addressed.  Even with only 2 bonded ports on the DC, the most
bandwidth a single TS box can tie up is half.  And if you implement port
level rate limiting of 500Mb/s for each of the 4 TS boxen switch ports
(in/out) you can never flood the DC.

> So, purchase list becomes:
> 1 x 4port ethernet card $450 each
> 7 x 2port ethernet card $161 each
> 1 x 48 port switch (any suggestions?) $600
> 
> Total Cost: $2177

Again, you have all the network hardware you need, so this is completely
unnecessary.  You just need to get what you have configured correctly.

>> I hope I've provided helpful information.
> 
> Definitely...

Everything above should be even more helpful.  My apologies for not
having precise LACP insight in my previous post.  It's been quite a
while and I was rusty, and didn't have time to refresh my knowledge base
before the previous post.

> Just in case the budget dollars doesn't stretch that far, would it be a
> reasonable budget option to do this:
> Add 1 x 2port ethernet card to the DC machine
> Add 7 x 1port ethernet card to the rest of the machines $32 (Intel Pro
> 1000GT DT Adapter I 82541PI Low-Profile PCI)
> Add 1 x 24port switch $300
> 
> Total Cost: $685

If the DC can take a PCIe x4 dual port card, that should work fine with
the reconfiguration I described above.  The rest of the gear in that
$685 is wasted--no gain.  Use part of the remaining balance for the LSI
9207-8i HBA.  That will make a big difference in throughput once you get
alignment and other issues identified and corrected, more than double
your current bandwidth and IOPS, making full time DRBD possible.

> I'm assuming this would stop sharing SMB/iSCSI on the same ports, and
> improve the ability for the TS machines to at least talk to the DC and
> know the IO is "in progress" and hence reduce the data loss/failures?

Again this is all unnecessary once you implement the aforementioned
changes.  If the IO errors on the TS machines still occur the cause
isn't in the network setup.  Running CIFS(SMB)/iSCSI on the same port is
done 24x7 by thousands of sites.  This isn't the cause of the TS IO
errors.  Congestion alone shouldn't cause them either, unless a Windows
kernel iSCSI packet timeout is being exceeded or something like that,
which actually seems pretty plausible given the information you've
provided.  I admit I'm no a Windows iSCSI expert.  If that is the case
then it should be solved by the mentioned LACP configuration and two
bonded ports on the DC box.

>> Keep us posted.
> 
> Will do, I'll have to price up the above options, and get approval for
> purchase, and then will take a few days to get it all in place/etc...

Given the temperature under the collar of the client, I'd simply spend
on adding the 2 bonded ports to the DC box, make all of the LACP
changes, and straighten out alignment/etc issues on the SSDs, md stripe
cache, etc.  This will make substantial gains.  Once the client sees the
positive results, then recommend the HBA for even better performance.
Remember, Intel's 520 SSD data shows nearly double the performance using
SATA3 vs SATA2.  Once you have alignment and md tuning squared away,
moving to the LSI should nearly double your block throughput.

> Thank you very much for all the very useful assistance.

You're very welcome Adam.  Note my email domain. ;)  I love this stuff.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html