On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote: > On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote: > > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio > > <akp@xxxxxxxxxxxx> wrote: > > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio > > > wrote: > > > > Hello, > > > > > > > > On one of our border routers, Netfilter is occasionally logging > > > > packets > > > > with "OUT=lo" (output interface lo) even though the packet should > > > > be > > > > going out via a regular interface. This behavior is present on > > > > Linux > > > > 4.13.0 to 4.14.9, and a bisection of the problem points to > > > > > > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call > > > > dst_dev_put() properly > > > > > > > > as the first bad commit. This commit adds dst_dev_put() calls > > > > before > > > > some dst_release() calls, and dst_dev_put() does > > > > > > > > dst->dev = dev_net(dst->dev)->loopback_dev; > > > > > > > > (among other things), which fits the problem we're seeing. > > > > > > > > The essential part of our nftables rule set that shows this > > > > behavior is > > > > > > > > chain forward { > > > > type filter hook forward priority 0; > > > > > > > > meta oif { $internal_interfaces } accept > > > > > > > > meta oif lo ip daddr != 127.0.0.0/8 \ > > > > log group 0 snaplen 80 prefix "oif-lo" > > > > counter > > > > > > > > ip saddr { $our_ip_series } \ > > > > flow table acct_out \ > > > > { meta oif . rt nexthop . ip saddr > > > > timeout 12m counter } \ > > > > accept > > > > > > > > log group 0 snaplen 80 prefix "DROP" counter drop > > > > } > > > > > > > > The router only does stateless packet filtering and no > > > > redirection or > > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are > > > > not > > > > even compiled for this kernel). > > > > > > > > As a result of this problem we see packets that should be going > > > > to an > > > > internal interface (and thus accepted by the first rule above) > > > > being > > > > logged and dropped by the last rule. Some examples: > > > > > > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > > > > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > > > It also happens for outbound traffic, where the packets are > > > > logged and > > > > counted in the acct_out flow table with "meta oif" = "lo", but a > > > > correct "rt nexthop" - an example: > > > > > > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo > > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 > > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF > > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 > > > > WINDOW=12618 ACK URGP=0 MARK=0 > > > > > > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo > > > > flow table acct_out { > > > > "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter > > > > packets 1 bytes 40 > > > > > > > > I don't know if these packets are actually sent out on the > > > > correct > > > > outbound interface thanks to the proper nexthop (the MAC= > > > > information > > > > in the Netfilter log is from the received packet and thus not > > > > useful > > > > here). > > > > > > > > I tried running a tcpdump on the lo interface to see if these > > > > packets > > > > would show up there, but during the three days I had it running, > > > > it > > > > only logged one such packet, while Netfilter logs 20+ outbound > > > > packets > > > > every day, and the one packet logged by tcpdump was *not* logged > > > > by > > > > Netfilter. > > > > > > Further testing of the individual parts of the first bad commit > > > shows > > > that the five first additions of the dst_dev_put() call doesn't > > > trigger > > > the problem, while the last one does (also without the first five), > > > so > > > the problematic part is: > > > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > > index 3dee004..d986d80 100644 > > > --- a/net/ipv4/route.c > > > +++ b/net/ipv4/route.c > > > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh, > > > struct rtable *rt) > > > prev = cmpxchg(p, orig, rt); > > > if (prev == orig) { > > > if (orig) { > > > + dst_dev_put(&orig->dst); > > > dst_release(&orig->dst); > > > rt_free(orig); > > > } > > > > > > > Hi Anders, > > > > First of all, the bad commit you pointed to is part of the effort to > > remove dst garbage collector. And it is expected that certain routing > > behavior will change after this whole patch set. > > In your specific case: > > Specifically for the one line change you pointed above, > > rt_cache_route() is called to replace an existing cached route with a > > new one because the existing cached route is already invalid. And the > > criteria to determine if it is valid is in rt_cache_valid(). If a > > route becomes invalid, it means the route is either being deleted, or > > a new route is being inserted to the same table, or there is certain > > netdev event happening that could make this route unusable. > > Hi Wei, > > I looked into this and activated debug logging for all route changes > from the Quagga/zebra routing daemons, and none of the logged packets > with OUT=lo happened within a minute of a route change for a prefix > that covered the destination IP in the packet. > > Also, some of the logged packets with OUT=lo are for destination IPs > that are on the subnet of a directly connected interface on the router, > so there are no dynamic route changes and there haven't been any link > changes either. > > Which other "certain netdev event" could cause this? > > > In all > > cases, a route re-lookup is required. dst_dev_put() here does cleanup > > work to free up the device refcount and all packets using this route > > will be dropped. And as you said, you don't really see the packets > > going out of lo. > > I wrote that we don't see the same packets being logged *and* show up > in a "tcpdump -i lo", but we do see packets in the tcpdump that should > have been routed out a regular interface - it's just not the same > packets that netfilter is logging (and it happens less frequently). It > looks like it depends on how far the packet is in the path through the > kernel, when dst_dev_put() is called. I spent some more time testing this in order to understand the scenario that causes this. It turns out that the packets being dropped due to OUT=lo only happen for the __mkroute_input() -> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put() call chain, when rt_cache_valid() returns false due to rt_is_expired(), which compares rth->rt_genid with rt_genid_ipv4(dev_net(rth->dst.dev)). The latter is bumped every time *any* fib route is changed (and in other cases), which thus invalidates everything in the route cache. On a router with full BGP feeds that happens quite frequently, and I believe the scenario we're seeing is: - Packet A received for destination X is assigned rth R - Some route in the FIB is added, deleted, or changed, so rt_genid is bumped - Packet B received for destination X before packet A has been forwarded - __mkroute_input() -> rt_cache_valid() -> rt_is_expired() determines that rth R is no longer valid - A new rth S is created and rth R is invalidated via __mkroute_input() -> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put() - Packet A now has rth R with dst->dev = dst->output = dst_discard_out and dev_net(dst->dev)->loopback_dev as it moves through the kernel forwarding (and Netfilter) logic, so it gets dropped To avoid this I developed the patch below, which updates rth->rt_genid to the current value of rt_genid_ipv4(dev_net(rth->dst.dev)), if the information in it is the same as what would be put into a new rth. This avoids invalidating the old rth and allocating a new one with the same information, and packet A from the scenario above is no longer dropped. The patch has a lot of net_warn_ratelimited() for debugging, but none of them have been triggered in my testing. I also revived and abused the in_hit counter to validate that the code path is being hit. After a few days with the patch there haven't been any more packet drops, and we have the following statistics: # rtstat -k entries,in_hit,in_slow_tot,in_slow_mc,in_no_route rt_cache|rt_cache|rt_cache|rt_cache|rt_cache| entries| in_hit|in_slow_|in_slow_|in_no_ro| | | tot| mc| ute| 1015| 2305460| 1177662| 758764| 1080447| Would a cleaned up (i.e. without the warnings and counter) version of this patch be appropriate for submission, or are there cases where this would break things? --- linux-4.14.18/include/net/route.h.orig 2017-11-12 19:46:13.000000000 +0100 +++ linux-4.14.18/include/net/route.h 2018-02-11 11:20:34.133788420 +0100 @@ -96,6 +96,7 @@ }; struct rt_cache_stat { + unsigned int in_hit; unsigned int in_slow_tot; unsigned int in_slow_mc; unsigned int in_no_route; --- linux-4.14.18/net/ipv4/route.c.orig 2018-02-11 11:12:41.801093116 +0100 +++ linux-4.14.18/net/ipv4/route.c 2018-02-14 07:27:54.290124737 +0100 @@ -295,7 +295,7 @@ seq_printf(seq,"%08x %08x %08x %08x %08x %08x %08x %08x " " %08x %08x %08x %08x %08x %08x %08x %08x %08x \n", dst_entries_get_slow(&ipv4_dst_ops), - 0, /* st->in_hit */ + st->in_hit, /* 0, / * st->in_hit */ st->in_slow_tot, st->in_slow_mc, st->in_no_route, @@ -1668,6 +1668,7 @@ { struct fib_nh_exception *fnhe; struct rtable *rth; + struct fib_nh *nh; int err; struct in_device *out_dev; bool do_cache; @@ -1713,23 +1714,110 @@ fnhe = find_exception(&FIB_RES_NH(*res), daddr); if (do_cache) { + nh = &FIB_RES_NH(*res); if (fnhe) { rth = rcu_dereference(fnhe->fnhe_rth_input); if (rth && rth->dst.expires && time_after(jiffies, rth->dst.expires)) { - ip_del_fnhe(&FIB_RES_NH(*res), daddr); + ip_del_fnhe(nh, daddr); fnhe = NULL; } else { goto rt_cache; } } - rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input); + rth = rcu_dereference(nh->nh_rth_input); rt_cache: if (rt_cache_valid(rth)) { skb_dst_set_noref(skb, &rth->dst); goto out; + } else if (rth && + rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK && + rth->dst.error == 0 && + rth->dst.dev == out_dev->dev && + rth->dst.ops == &ipv4_dst_ops && +#ifdef CONFIG_XFRM + rth->dst.xfrm == NULL && +#endif + rth->dst.flags == + ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0)) && + rth->dst.lwtstate == nh->nh_lwtstate && + rth->rt_type == res->type && + rth->rt_table_id == + (res->table ? res->table->tb_id : 0) && + rth->rt_gateway == + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + nh->nh_gw : 0) && + rth->rt_uses_gateway == + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + 1 : 0)) { + rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev)); + skb_dst_set_noref(skb, &rth->dst); + RT_CACHE_STAT_INC(in_hit); + if (rth->dst.header_len != 0) + net_warn_ratelimited("rth->dst.header_len (%x) != 0\n", rth->dst.header_len); + if (rth->dst.trailer_len != 0) + net_warn_ratelimited("rth->dst.trailer_len (%x) != 0\n", rth->dst.trailer_len); + if (rth->rt_is_input != 1) + net_warn_ratelimited("rth->rt_is_input != 1\n"); + if (rth->rt_flags != 0) + net_warn_ratelimited("rth->rt_flags (%x) != 0\n", rth->rt_flags); + if (rth->rt_iif != 0) + net_warn_ratelimited("rth->rt_iif (%x) != 0\n", rth->rt_iif); + if (rth->rt_pmtu != 0) + net_warn_ratelimited("rth->rt_pmtu (%x) != 0\n", rth->rt_pmtu); + if (rth->dst.output != ip_output) + net_warn_ratelimited("rth->dst.output != ip_output\n"); + if (rth->dst.input != ip_forward) + net_warn_ratelimited("rth->dst.input != ip_forward\n"); + goto out; + } else if (rth && + rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK) { + if (rth->dst.error != 0) + net_warn_ratelimited("rth->dst.error (%x) != 0\n", rth->dst.error); + if (rth->dst.dev != out_dev->dev) + net_warn_ratelimited("mkroute_input: rth->dst.dev != out_dev->dev\n"); + if (rth->dst.ops != &ipv4_dst_ops) + net_warn_ratelimited("mkroute_input: rth->dst.ops != &ipv4_dst_ops\n"); +#ifdef CONFIG_XFRM + if (rth->dst.xfrm != NULL) + net_warn_ratelimited("mkroute_input: rth->dst.xfrm != NULL\n"); +#endif + if (rth->dst.flags != + ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0))) + net_warn_ratelimited("rth->dst.flags (%x) != %x\n", + rth->dst.flags, + (IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0)); + if (rth->rt_type != res->type) + net_warn_ratelimited("rth->rt_type (%x) != res->type (%x)\n", + rth->rt_type, res->type); + if (rth->rt_table_id != + (res->table ? res->table->tb_id : 0)) + net_warn_ratelimited("rth->rt_table_id (%x) != (%x)\n", + rth->rt_table_id, res->table ? res->table->tb_id : 0); + if (rth->rt_gateway != + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + nh->nh_gw : 0)) + net_warn_ratelimited("rth->rt_gateway (%x) != (%x)\n", + rth->rt_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? nh->nh_gw : 0); + if (rth->rt_uses_gateway != + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + 1 : 0)) + net_warn_ratelimited("rth->rt_uses_gateway (%x) != (%x)\n", + rth->rt_uses_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? 1 : 0); + if (rth->dst.lwtstate != nh->nh_lwtstate) + net_warn_ratelimited("rth->dst.lwtstate (%x) != nh->nh_lwtstate (%x)\n", + rth->dst.lwtstate, nh->nh_lwtstate); } } Regards, Anders K. Pedersen��.n��������+%������w��{.n����z�����n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�