200ms retransmits, detecting very short network spikes

Marlon de Boer <marlon@xxxxxxxx> · Wed, 24 Mar 2010 14:35:23 +0100

Hi list,

The problem in short: in our network we detect tcp retransmits that
shouldn't be there in my opion. This happens in the following setup.

- Cabinets filled with up to 20 servers connected to a 24 ports switch
(all 1Gbit utp links).
- These cabinets switches are connected to our routers via 1Gbit utp.
- Cabinets don't do over 400Mbit/s on the uplink port (measured over a
10sec interval).
- Cabinets contain the following: webservers + php, MySQL , http static
content, memcached and some small other architectures.
- All servers run linux with 2.6 kernels.
- Current switches have the following specs: Switch Fabric Capacity 48.0
Gbps, Forwarding Rate 35.6 Mpps

We first noticed packetloss when we introduced memcached multigets. We
saw that those pages sometimes rendered +200ms slower than normal,
looking deeper into that problem we saw that was caused by tcp
retransmits that took 200ms.

We than start writing a simple client server application which could
reproduce the 200ms timeout in those cabinets (servers in these cabinets
where still running in a live environment meanwhile). Debugging this
further we noticed that the retransmits never happen below a 150Mbit/s
usage of the cabinet uplink. After we enabled flow-control on both the
router and the cabinet switch (all ports) things looked a lot better,
but didn't solved the problem completely. Then we upgraded the 1Gbit
uplink to a 2Gbit trunk as a test which improved the situation even more.

We already tested test the following without improvements:

- move from utp to fiber
- move from cat5e to cat6e
- upgrade to switches with more switching capacity and more backplane
buffers

Because it's madness to upgrade to 2Gbit links when you're only doing
150 to 400Mbit/s of sustained traffic in a cabinet we looked for a way
to detect network spikes.

We started using libpcap to calculate the bandwidth over a span of 100
received packets. We've build a bridge that could be placed between the
uplink and the switch and run the app there. This resulted in spikes up
to 1.5Gbit/s which is impossible on a 1Gbit link, this offcourse happend
because my libpcap app runs in userspace and cannot see the time spend
buffering in the network hardware or kernel. Calculating your Mbit/s
over a 10ms with wrong timings can make a very big difference. When we
move to a 100.000 packet interval, calculation takes up to 300ms and
doesn't shows the sharp spikes anymore.

With the bridge in between the uplink and switch we also noticed that
when a retransmit occurred the original packet was lost in the switch,
it came in via one of the hosts in the cabinetswitch but never reached
the uplink to the router.

Does anybody knows a reliable way to monitor the bandwidth in very short
intervals, like 10ms? Things like iptraf, snmp polling of hardware
aren't accurate enough to detect the micro burts we suspect to be there.
Or does anybody recognizes the problem and has some tips to prevent the
200ms retransmits? We're not looking to any kernel "patches" like
lowering the RTO.

Our goal is to prove that we have very short spikes on our network that
exceed our 1Gbit link capacity.

Regards,

Marlon de Boer
System engineer http://www.hyves.nl
--
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html