"Kernel bug detected [...] nf_ct_del_from_dying_or_unconfirmed_list"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I was trying to implement a multicast-to-multi-unicast conversion
in batman-adv with the following patch:

https://patchwork.open-mesh.org/patch/17729/

However, on OpenWrt with a 4.9.146 kernel I get a
"Kernel bug detected [...] nf_ct_del_from_dying_or_unconfirmed_list".

This only happens upon sending a SIGTERM to the network manager
"netifd" (so upon network shutdown). And only if the node is connected
to mesh of reasonable size, so if there is a certain amount of
multicast traffic for the multicast-to-multi-unicast patch to work on.

Upon normal operation, no such crash seems to occur.

The crash itself is triggered by the:

  BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));

in here:

https://elixir.bootlin.com/linux/v4.9.146/source/net/netfilter/nf_conntrack_core.c#L354


What confuses me a bit is, that the multicast-to-multi-unicast
conversion uses the same/similar, simple skb_copy() approach like the
"classic broadcast flooding" approach in batman-adv so far. The latter too
transmits three redundant frames via skb_copy() to increase
reliability for Wifi broadcast packets.

One difference is that the broadcast flooding adds a bit of
delay between each transmission. Which the multicast-to-multi-unicast
doesn't.

Looking at "git log net/netfilter/nf_conntrack_core.c" I noticed
"netfilter: nfnetlink_queue: resolve clash for unconfirmed
conntracks" (368982cd7). Which says:

"In nfqueue, two consecutive skbuffs may race to create the conntrack
 entry. Hence, the one that loses the race gets dropped due to clash in
 the insertion into the hashes from the nf_conntrack_confirm() path."

This patch is only part of >= 4.18, so not part of the firmware we use
yet. Could this issue somehow be related?


Other than that I was wondering whether we might be missing to
reset something after skb_copy()-ing. We do a "skb->protocol =
htons(ETH_P_BATMAN)" right before the dev_queue_xmit(skb) call in
batman-adv which sends the encapsulated frame into the
mesh. And we do a nf_reset(skb) after decapsulating a frame
received from the mesh. But maybe that is not enough?

Ticket this issue was reported at:

https://github.com/freifunk-gluon/gluon/issues/1468

Regards, Linus



[Index of Archives]     [Netfitler Users]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux