[RFC 0/1] avoid busy loop in __nft_build_cache

Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> · Wed, 21 Aug 2019 09:56:10 +0200

I have ran into a busy loop in netfilter due to the testing that Ubuntu does
=> https://bugs.launchpad.net/ubuntu/+source/iptables/+bug/1840633

This only occurs with 1.8.3, and in the same test didn't affet 1.6.*

It seems to be rare, we run that test quite a few times and only in one
special environment it does trigger.
But there 100%, I don't know exact steps to reproduce, but once in that
broken mode any call to "iptables -N" will fail with a hang.

Here is a stacktrace of a new -N call into the loop (the last few lines will
just repeat).
=> https://paste.ubuntu.com/p/YTQStdg6vm/

Note: once in that bad state other calls hang as well, like "iptables -L"
Note: Nothing suspicious in dmesg or journal
Note: not sure if it is important, but this happened on Ubuntu kernel
      5.2.0-10-generic. Due to the way I need to trigger this it is hard
      to test
others until I have found an easier way to reproduce it.

I have attached gdb to one such hanging process and found it looping in:
 static void __nft_build_cache(struct nft_handle *h)
 {
 »···uint32_t genid_start, genid_stop;

 retry:
 »···mnl_genid_get(h, &genid_start);
 »···fetch_chain_cache(h);
 »···fetch_rule_cache(h);
 »···h->have_cache = true;
 »···mnl_genid_get(h, &genid_stop);

 »···if (genid_start != genid_stop) {
 »···»···flush_chain_cache(h, NULL);
 »···»···goto retry;
 »···}

 »···h->nft_genid = genid_start;
 }

It seems that "if (genid_start != genid_stop)" never is untrue, I get
 (gdb) p genid_stop
 $5 = 32767
 (gdb) p genid_start
 $6 = 3596142128
And those values never change.
I don't know how to handle this, but we could either detect through the loop
if both genid_* values don't change to fail with an error.
If so what would be the proper return value from here?

Since I don't exactly know how to trigger this scenario, but can run our
tests into it let me know if I should get some data from the test system.

This goes down
  mnl_genid_get
    -> mnl_talk
And there we see the socket calls that I have seen in strace.

In mll_talk it gets to this:

 ret = mnl_socket_recvfrom(h->nl, buf, sizeof(buf));
 while (ret > 0) {
 »···ret = mnl_cb_run(buf, ret, h->seq, h->portid, cb, data);
 »···if (ret <= 0)
 »···»···break;

Which first gets 40 back from the socket and then runs mnl_cb_run.
This returns -1; and due to that breaks.
That makes it return -1 and have no new generation id set.

IMHO in __nft_build_cache if mnl_genid_get returns -1 than things are broken
and we can not rely on the values it was supposed to place in
"uint32_t *genid".

The problem is that __nft_build_cache itself has no return value and we can't
set anything "smart" to its final action:
    1597 »···h->nft_genid = genid_start;

We could track if the genid's are "not changing" but that could in theory be
a valid case. The -1 instead clearly means something went wrong.
I was attaching gdb to a new iptables -N run and it immediately enters this
error condition.

(gdb) run -N ufw-caps-test2
  Starting program: /usr/sbin/iptables -N ufw-caps-test2
  Breakpoint 1, 0x000000010001e348 in __nft_build_cache (h=0x7fffffffed00)
  at nft.c:1582
  1582    {
  [...]
  (gdb) p genid_start
  $1 = 4294961248
  (gdb) p genid_stop
  [...]
  1592            if (genid_start != genid_stop) {
  (gdb) p genid_stop
  $3 = 32767
  (gdb) p genid_start
  $4 = 4294961248
  (gdb) n
  [...]
  (gdb) n
  1592            if (genid_start != genid_stop) {
  (gdb) p genid_start
  $5 = 4294961248
  (gdb) p genid_stop

So whatever happened to make the system respond that way it then is persistent.

There also is h->have_cache which is
a) a requirement to enter __nft_build_cache
b) something set to true after creation
We might want/need to unset this on the error condition and consumers check on
it before usage.

Users and their error conditions:
cache creation (just let them retry every time):
 - nft_rebuild_cache
 - nft_build_cache

cache usage:
 - nftnl_table_list_get -> ret NULL
 - nft_chain_list_get  -> ret NULL

The problem here is that this is just "better error handling" the error
persists and in this environment we won't be able to set up any cache maybe
even no chain at all.

I installed a version with my RFC fix applied and that at least got no
busy hang anymore.
What I found was that formerly (with the hang in -N) no rules got create at
all. Due to my fix for -N actually works, I now see them added (seen in nft).
But it seems that iptables -L tries to access the cache (which we know fails
to be created) and therefore goes into an error path.

  iptables v1.8.3 (nf_tables): table `filter' is incompatible, use 'nft' tool.

This message might not be perfect (or maybe it is and I don't realize).
But it made me check nft and if it knows anything about the rules that might
help.

With nft I see the current rules (maybe those are suspicious to you?).

$ sudo nft list ruleset
table ip mangle {
        chain PREROUTING {
                type filter hook prerouting priority mangle; policy accept;
        }

        chain INPUT {
                type filter hook input priority mangle; policy accept;
        }

        chain FORWARD {
                type filter hook forward priority mangle; policy accept;
                meta l4proto tcp tcp flags & (syn|rst) == syn counter packets
                     0 bytes 0 tcp option maxseg size set rt mtu
        }

        chain OUTPUT {
                type route hook output priority mangle; policy accept;
        }

        chain POSTROUTING {
                type filter hook postrouting priority mangle; policy accept;
        }
}

And with the fix iptables -N adds entries at the bottom like:
table ip filter {
        chain foo1 {
        }

        chain foo2 {
        }
}

This might be the "filter" that the error of "iptables -L" is referring to.
Flushing all rules with nft doesn't change anything in regard to the broken
situation.

I guess we need to find/understand the root cause as my patch just makes the
symptom "a bit less fatal". So far the system is up and running still and I
can debug/check whatever you need.

With more experience of the area or the codebas there might be a much better
solution. Due to that this is meant as RFC to get things going and I'm
reaching out to your expertise.

Christian Ehrhardt (1):
  nft: abort cache creation if mnl_genid_get fails

 iptables/nft.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

-- 
2.22.0