Re: [RFC] connlimit for nftables and limiting ct count by zone

Yi-Hung Wei <yihung.wei@xxxxxxxxx> · Thu, 26 Oct 2017 17:49:17 -0700

On Mon, Oct 16, 2017 at 7:42 AM, Florian Westphal <fw@xxxxxxxxx> wrote:
> During NFWS we briefly discussed iptables '-m connlimit' and how
> to apply this to nftables.
>
> There is also a use case to make nf_conntrack_max more fine-grained
> by making this setting apply per conntrack zone.
>
> I'd like to ideally resolve both with a single solution.
Thanks for the proposal. It would be great to be able to resolve the
two issues at the same time.

> From nft (user front end) this would look like this:
>
> add rule inet filter input ct state new ct count { ip saddr } gt 100 drop
>
> The part enclosed in { } denotes the key/grouping that should be applied.
>
> e.g.
>
> ct count { ip saddr & 255.255.255.0 }   # count by source network
> ct count { ip saddr & 255.255.255.0 . tcp dport }   # count by source network and service
> ct count { ct zone } # count by zone
Does it mean that all the ct zones will have the same limit? If it is,
shall we extend this syntax to support different limit for different
zone? How would we use nft to get the ct limit and ct count on
particular key ? It may be useful for debugging and monitoring
purpose?

Other than using nft to configure ct limit, if other user space
utilities want to set/get the ct limit or query the current ct count
would there be any API to achieve that?

> For this to work there are several issues that need to be resolved.
>
> 1. xt_connlimit must be split into an iptables part and a 'nf_connlimit'
>    backend.
>
>    nf_connlimit.c would implement the main function:
>
> unsigned int nf_connlimit_count(struct nf_connlimit_data *,
>                                 const struct nf_conn *conn,
>                                 const void *key,
>                                 u16 keylen);
>
> Where 'nf_connlimit_data' is a structure that contains the (internal)
> bookkeeping structure(s), conn is the connection that is to be counted,
> and key/keylen is the (arbitrary) identifier to be used for grouping
> connections.
>
> xt_connlimits match function would then build a 'u32 key[5]' based on
> the options the user provided on the iptables command line, i.e.
> the conntrack zone and then either a source or destination network
> (or address).
It is not clear to me that why we have 'u32 key[5]'. Is it because the
key would be the combination of Ip/ipv6 src or dst address with mask
(up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?

Do we need to specify the order for the key? How do we distinguish the
case where we have zone (u16) as a key vs. dport (u16) as a key?

> 3. Other users, such as ovs, could also call this api, in the case of
> per-zone limiting key would simply be a u16 containing the zone identifier.
I am wondering if there will be APIs to set and get the conntrack
limit on particular key, and also query the ct count on particular key
in the kernel space?

> Right now, connlimit performs garbage collection from the packet path.
> This isn't a big deal now, as we limit based by single ip address or network
> only, the limit will be small.
>
> bookkeeping looks like this:
>
>       Node
>       / \
>    Node  Node  -> conn1 -> conn2 -> conn3
>          /
>       Node
>
> Each node contains a single-linked list of connections that share the
> same source (address/network).
>
> When searching for the Node, all Nodes traversed get garbage-collected,
> i.e. connlimit walks the hlist attached to the node and removes any tuple
> no longer stored in the main conntrack table.  If the node then has
> empty list, it gets erased from the tree.
Thanks for the explanation. It saves me quite some time to understand
the code structure there.

> But if we limit by zone then its entirely reasonable to have a limit
> of e.g.  10k per zone, i.e. each Node could have a list consisting
> of 10k elements.  Walking a 10k list is not acceptable from packet path.
>
> Instead, I propose to store a timestamp of last gc in each node, so
> we can skip intermediate nodes that had a very recent gc run.
>
> If we find the node (zone, source network, etc. etc) that we should
> count the new connection for, then do an on-demand garbage collection.
>
> This is unfortunately required, we don't want to say '150' if the limit
> is 150 then the new connection would be dropped, unless we really still
> have 150 active connections.
>
> To resolve this, i suggest to store the count in the node itself
> (so we do not need to walk the full list of individual connections in the
>  packet path).
>
> The hlist would be replaced with another tree:
>
> This allows us to quickly find out if the tuple we want to count now
> is already stored (e.g. because of address/port reuse) or not.
>
> It also permits us to check all tuples we see while searching for
> the 'new' tuple that should be stored in the subtree and remove those that
> are no longer in the main conntrack table.
> We could also abort a packet-path gc run if we found at least one old
> connection.
I agree that it is inefficient to do GC if we host say 10k connections
in the hlist within a tree node. I think all the aforementioned
optimization mechanisms such as storing 'timestamp' and 'count' in the
tree node, aborting if we find at least one old connections, and
replacing hlist with a tree, are useful. Basically, the first two
techniques will reduce unnecessarily GC on the packet path, and
replacing hlist with a rb_tree will further reduce the insertion time
to the tree node from O(n) to O(log n), i.e. n is # of connections in
the hlist.

Just as you mentioned in the DESTORY event approach. I am wondering if
we can store an extra pointer to the tree node maybe in struct
nf_ct_ext, so that we can associate nf_conn with the tree node and
increase the count when the connection is confirmed, and we will
reduce the count when the connection is destroyed. In this way, with
some extra space, the insertion and updating the count in the tree
node is O(1) and we can get rid of GC scheme. I am not sure how much
effort does it take to modify the conntrack core tho.

Thanks,

-Yi-Hung
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html