Re: Conntrack not matching properly - producing serious outages

"John A. Sullivan III" <jsullivan@xxxxxxxxxxxxxxxxxxx> · Thu, 11 Aug 2011 15:14:02 -0400

On Thu, 2011-08-11 at 14:26 +0200, Jozsef Kadlecsik wrote:
> On Thu, 11 Aug 2011, John A. Sullivan III wrote:
> 
> > On Thu, 2011-08-11 at 12:12 +0200, Jozsef Kadlecsik wrote:
> > > 
> > > On Thu, 11 Aug 2011, John A. Sullivan III wrote:
> > > 
> > > > Hello, all.  We have been having a subtle problem with conntrack for
> > > > quite a long time but it has suddenly gotten much worse.  Packets are
> > > > being matched as INVALID when we would expect them to be ESTABLISHED.
> > > > We are running on kernel 2.6.30.5 on X86_64 with CentOS 5.4 and
> > > > iptables-1.3.5-5.3.el5_4.1.  This has escalated from a minor annoyance
> > > > that we were going to investigate to provoking serious outages and all
> > > > hands to the pump.
> > > > 
> > > > The conntrack table is not swamped although we did increase the max
> > > > count and the hashsize just in case to no avail:
> > > > [root@fw01 netfilter]# cat ip_conntrack_max
> > > > 65536
> > > > [root@fw01 netfilter]# cat ip_conntrack_count
> > > > 532
> > > > 
> > > > Here are three specific examples.  The first is from the FORWARD chain.
> > > > Here are the logging messages:
> > > >  
> > > > Aug 11 03:29:19 fw01 kernel: FORWARD INVALID IN=bond1 OUT=bond4
> > > > SRC=172.x.y.73 DST=172.x.z.34 LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=32940
> > > > DF PROTO=TCP SPT=8080 DPT=52999 WINDOW=34 RES=0x00 ACK FIN URGP=0
> > > 
> > > Those are, with high probabilty, late FIN packets: the belonging conntrack 
> > > entry has already been deleted and thus conntrack cannot find the matching 
> > > stream, therefore it sets as INVALID.
> > Thank you very much, Jozsef.  That would explain why we did not
> > categorize this as a high priority in the past as it seemed to have
> > minimal impact.  I would guess we do not need to be concerned about
> > these.
> > 
> > However, the other two are much more problematic and what escalated this
> > into a crisis.  As I just explained in another reply, these are
> > happening in the middle of activity, i.e., they are NX remote desktop
> > sessions being carried via SSH.  The users are in the middle of typing
> > or scrolling through their desktops, in other words, the connection is
> > definitely active and passing many packets.  Then, without warning,
> > their desktops freeze, the connection eventually times out, and we see
> > these INVALID and dropped packets.  That's the one we really need to
> > solve.
> 
> That might be related to SACK option handling: some "clever" devices loves 
> to mangle TCP SEQ/ACK values, but forget about the SACK options. Try to 
> disable SACK support on both communicating endpoints. If the problem 
> disappears, then it's a SACK issue.
> 
<snip>
Alas, it is not SACK.  We disabled sack and dsack on both sides of one
user and it still took all of a few seconds for him to lock up.

Where do we look next? Thanks - John

--
To unsubscribe from this list: send the line "unsubscribe netfilter" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html