Re: clogging qdisc

Linux Advanced Routing and Traffic Control

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/27/18 10:15 AM, Grzegorz Gwóźdź wrote:
This solution worked for few years in several networks but in one network since few weeks, in peak hours that mechanism clogs.

Okay. It sounds to me like the methodology works well enough. But it might have scaling problems.

Pings to all local hosts grows to hundreds ms (even to hosts without any traffic) and throughtput drops.

Ouch.

The only solution is:
tc qdisc del root dev eth0

That doesn't seem like a solution.  Maybe a workaround, if you're lucky.

If I immediately add rules again problem immediately starts too.

That sounds like the workaround doesn't even work.

But after some time even though traffic is bigger I load queues and everything works until next attack

I'm thinking that "attack" might be the proper word.

I'm wondering if this is a number of packets per second vs the size of packets per second issue.

Specifically if the "attack" is considerably more smaller packets than normal. I'm guessing normal traffic is fewer but bigger packets.

Take a packet capture during normal traffic periods and a separate packet capture during attack traffic periods.

Then open each of the captures in Wireshark and pull up the Packet Lengths report from the Statistics menu. I'm guessing that you will see a significant difference between the two captures.

I don't think it is hardware issue because this system works in LXC container and on the same NIC in other container (doing the same work for other clients) everything works fine.

I think containers and VMs are good for some things. I don't think that they are (more specifically their overhead is) good for high throughput traffic. Particularly large numbers of small packets per second (PPS). High PPS with small traffic requires quite a bit of optimization. I also think it's actually rare outside of specific situations. What's I've seen more frequently is fewer (by one or more orders of magnitude) packets that are larger (by one or more orders of magnitude). Overall the amount of data is roughly the same. But /how/ it's done can cause considerable load on equipment. Especially equipment that is not optimized for high PPS, much less additional overhead like containers or VMs.

Load on system is low, there is no hardware problem, whole hardware has been replaced, on new hardware I've installed new system (Ubuntu 18.04)
No dropped packets in interface statistics. dmesg is clear.

What messages were you seeing in dmesg before?

As a result conntrack table grows until overflow (if I don't delete qdisc)
I even sniffed all traffic and tried to analyze it but it's hard since it's over 1Gbps (on 10Gb interface)

The connection tracking table overflowing tells me one of two things. That you are truly dealing with a high PPS condition -or- that you don't have enough memory in the system and the size of the conntrack table is restricted.

I once took a system that comfortable ran with ~512 MB of memory up to 4 GB to allow the conntrack table to be large enough for what the system was doing. (I think the conntrack table was a fixed percentage of memory in that kernel. Maybe it's a tune able now.)

What can I check?

Check the Packet Lengths graph as suggested above.

If your problem is indeed high PPS, you might also be having problem outside of your Linux machine. It's quite possible that there is enough traffic / PPS that things like switches and / or wireless access points are also being negatively effected. It's possible that they are your choke point and not the actual Linux system.

Where to look for a cause?

I think you need to get a good understanding of what your two typical traffic patterns are, normal, and attack. Including if this is legitimate traffic or if it is someone conducting an attack and the network is buckling under the stress.

You might also consider changing out network cards. I've been around people that like to poo poo some Realtek cards and other non-Intel / non-Broadcom NICs. Admittedly, some of the better NICs have more CPU / memory / I/O on them to handle more traffic.

You might want to evaluate your container network configuration. I've read a few places that there are performance differences in Linux native bridges, OvS, {MAC,IP}VLAN. It's possible that based on shear numbers, the difference in performance is adding up and starting to cause problems under the higher traffic load.

Unfortunately, I think you have more investigating to do to be able to identify what the problem might be. Simply trading out hardware is likely to be expensive and come with a lot of annoyances (or worse).



--
Grant. . . .
unix || die

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [LARTC Home Page]     [Netfilter]     [Netfilter Development]     [Network Development]     [Bugtraq]     [GCC Help]     [Yosemite News]     [Linux Kernel]     [Fedora Users]
  Powered by Linux