Hi, You should send this to netdev@xxxxxxxxxxxxxxx, where all the Linux network developers hang out. -Bill On Mon, 26 Jan 2009, Tobias Klausmann wrote: > Hi, > > it seems I've stumbled across a bug in the way Netfilter handles > packets. I have only been able to reproduce this with UDP, but it > might also affect other IP protocols. This first bit me when > updating from glibc 2.7 to 2.9. > > Suppose a program calls getaddrinfo() to find the address of a > given hostname. Usually, the glibc resolver asks the name server > for both the A and AAAA records, gets two answers (addresses or > NXDOMAIN) and happily continues on. What is new with glibc 2.9 is > that it doesn't serialize the two requests in the same way as 2.7 > did. The older version will ask for the A record, wait for the > answer, ask for the AAAA record, then wait for that answer. The > newer lib will fire off both requests in short time (usually 5-20 > microseconds apart on the systems I tested with). Not only that, > it also uses the same socket fd (and thus source port) for both > requests. > > Now if those packets traverse a Netfilter firewall, in the > glibc-2.7 case, they will create two conntrack entries, allowing > the answers back[0] and everything is peachy. In the glibc-2.9 > case, sometimes, the second packet gets lost[1]. After > eliminating other causes (buggy checksum offloading, packetloss, > busy firewall and/or DNS server and a host of others), I'm sure > it's lost inside the firewall's Netfilter code. > > Using counting-only rules and building a dedicated setup with a > minimal Netfilter rule set, we could watch the counters, finding > two interesting facts for the failing case: > > - The count in the NAT pre/postrouting chains is higher than for > the case where the requests work. This points to the second > packet being counted although it's part of the same connection > as the first. > > - All other counters increase, up to and including > mangle/POSTROUTING. > > In essence, if you have N tries and one of them fails, you have > 2N packets counted everywhere except the NAT chains, where it's > N+1. > > Since neither QoS nor tunneling is involved, the second packet > appears to be dropped by Netfilter or the NICs code. Since we see > this behaviour on varying hardware, I'm rather sure it's the > former. > > The working hypothesis of what happens is this: > > - The first packet enters Netfilter code, triggering a check if a > conntrack entry is relevant for it. Since there is no entry, > the packet creates a new conntrack that isn't yet in the global > hash of conntrack entries. Since the chains could modify the > packet's relevant info, the entry can not be added to the hash > then and there (aka unconfirmed conntrack). > > - The second packet enters Netfilter code. Again, no conntrack > entry is relevant since the first packet has not gotten to the > point where its conntrack would have been added to the global > hash, so the second packet gets an unconfirmed conntrack, too. > > - The first packet reaches the point where the conntrack entry is > added to the global hash. > > - The second packet reaches the same point but since it has the > same src/sport-dst/dport-proto tuple, its conntrack causes a > clash with the existing entry and both (packet and entry) are > discarded. > > Since the timing is very critical on this, it only happens if an > application (such as the glibc resolver of 2.9) fires two packets > rapidly *and* those have the same 5-tuple *and* they are > processed in parallel (e.g. on a multicore machine). > > Another observation is that this happens much less often with > some kernels. While the on one it can be triggered about 50% of > the cases, on another you can go for 20k rounds of two packets > before the bug is triggered. Note, however, that the > probabilities vary wildly: I've seen the program break on the > first 100 packets a dozen times in a row and later not breaking > for 50k tries in a row on the same kernel. > > Since glibc 2.7 is using different ports and waiting for answers, > it doesn't trigger this race. I guess there are very few > applications where normal operations allow for a quickfire of the > first two UDP packets in this manner. As a result, this has gone > unnoticed for quite a while - and even if it happens, it may look > like a fluke. > > When looking at the conntrack stats, we also see that > insert_failed in /proc/net/stat/nf_conntrack does indeed increase > when the routing of the second packet fails. > > The kernels used on the firewall (all vanilla versions): > 2.6.25.16 > 2.4.19pre1 > 2.6.28.1 > > All of them show this behaviour. On the clients, we only have > 2.6-series kernels, but I doubt they influence this scenario > (much). > > Regards, > Tobias > > [0] In the usual setup > [1] Sometimes. Not always. Read on for probabilities. > > PS: I'm not subscribed to the list, so please CC me where > appropriate. Thanks. > -- > Save a tree - disband an ISO working group today. -- To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html