On Thu, 2011-08-11 at 14:26 +0200, Jozsef Kadlecsik wrote: > On Thu, 11 Aug 2011, John A. Sullivan III wrote: > > > On Thu, 2011-08-11 at 12:12 +0200, Jozsef Kadlecsik wrote: > > > > > > On Thu, 11 Aug 2011, John A. Sullivan III wrote: > > > > > > > Hello, all. We have been having a subtle problem with conntrack for > > > > quite a long time but it has suddenly gotten much worse. Packets are > > > > being matched as INVALID when we would expect them to be ESTABLISHED. > > > > We are running on kernel 2.6.30.5 on X86_64 with CentOS 5.4 and > > > > iptables-1.3.5-5.3.el5_4.1. This has escalated from a minor annoyance > > > > that we were going to investigate to provoking serious outages and all > > > > hands to the pump. > > > > > > > > The conntrack table is not swamped although we did increase the max > > > > count and the hashsize just in case to no avail: > > > > [root@fw01 netfilter]# cat ip_conntrack_max > > > > 65536 > > > > [root@fw01 netfilter]# cat ip_conntrack_count > > > > 532 > > > > > > > > Here are three specific examples. The first is from the FORWARD chain. > > > > Here are the logging messages: > > > > > > > > Aug 11 03:29:19 fw01 kernel: FORWARD INVALID IN=bond1 OUT=bond4 > > > > SRC=172.x.y.73 DST=172.x.z.34 LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=32940 > > > > DF PROTO=TCP SPT=8080 DPT=52999 WINDOW=34 RES=0x00 ACK FIN URGP=0 > > > > > > Those are, with high probabilty, late FIN packets: the belonging conntrack > > > entry has already been deleted and thus conntrack cannot find the matching > > > stream, therefore it sets as INVALID. > > Thank you very much, Jozsef. That would explain why we did not > > categorize this as a high priority in the past as it seemed to have > > minimal impact. I would guess we do not need to be concerned about > > these. > > > > However, the other two are much more problematic and what escalated this > > into a crisis. As I just explained in another reply, these are > > happening in the middle of activity, i.e., they are NX remote desktop > > sessions being carried via SSH. The users are in the middle of typing > > or scrolling through their desktops, in other words, the connection is > > definitely active and passing many packets. Then, without warning, > > their desktops freeze, the connection eventually times out, and we see > > these INVALID and dropped packets. That's the one we really need to > > solve. > > That might be related to SACK option handling: some "clever" devices loves > to mangle TCP SEQ/ACK values, but forget about the SACK options. Try to > disable SACK support on both communicating endpoints. If the problem > disappears, then it's a SACK issue. > <snip> Alas, it is not SACK. We disabled sack and dsack on both sides of one user and it still took all of a few seconds for him to lock up. Where do we look next? Thanks - John -- To unsubscribe from this list: send the line "unsubscribe netfilter" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html