Fabian Hugelshofer wrote:
Patrick McHardy wrote:
Fabian Hugelshofer wrote:
Again most of the time is spent in the kernel. Memory and skb
operations are accounted there. I suspect that they cause the most
overhead.
Do you plan to dig deeper into optimising the non-optimal parts? I
consider myself not to have enough understanding to do it myself.
The first thing to try would be to use sane allocation sizes
for the event messages. This patch doesn't implement it properly
(uses probing), but should be enough to test whether it helps.
Thanks a lot. This patch already decreased the CPU usage for ctevtest
from 85% to 44%. Sweet...
Nice. Now we just need to do it properly :)
I created a new callgraph profile which you find attached to this mail.
Let's have a look at two parts:
First:
2055 2.7205 ctnetlink_conntrack_event
2378 21.6201 nla_put
2181 19.8291 nfnetlink_send
2055 18.6835 ctnetlink_conntrack_event [self]
1250 11.3647 __alloc_skb
955 8.6826 ipv4_tuple_to_nlattr
752 6.8370 nf_ct_port_tuple_to_nlattr
321 2.9184 __memzero
220 2.0002 nfnetlink_has_listeners
177 1.6092 nf_ct_l4proto_find_get
155 1.4092 __nla_put
116 1.0546 nf_ct_l3proto_find_get
82 0.7455 module_put
70 0.6364 nf_ct_l4proto_put
66 0.6001 nf_ct_l3proto_put
60 0.5455 nlmsg_notify
43 0.3909 netlink_has_listeners
42 0.3819 __kmalloc
37 0.3364 kmem_cache_alloc
26 0.2364 __nf_ct_l4proto_find
13 0.1182 __irq_svc
nf_conntrack_event is now one of the first functions listed. Do you see
other ways of improving performance?
For some members doing in-place message construction instead of
copying the data might help, but I couldn only spot few only
used rarely.
The module reference stuff (module_put/nf_ct_*_find_get etc)
is clearly superfluous, this runs in packet processing context
and shouldn't use module references but RCU.
Second:
33 2.4775 __nf_ct_ext_add
63 4.7297 dev_hard_start_xmit
65 4.8799 sock_recvmsg
77 5.7808 netif_receive_skb
92 6.9069 __nla_put
96 7.2072 nf_conntrack_alloc
199 14.9399 nf_conntrack_in
246 18.4685 skb_copy
427 32.0571 nf_ct_invert_tuplepr
1793 2.3737 __memzero
1793 100.000 __memzero [self]
Is the zeroing of the inverted tuple in nf_ct_invert_tuple really
required? As far as I can see all fields are set by the subsequent code.
It dependfs on the protocol family. For IPv6 its completely
unnecessary, for IPv4 the last 12 bytes of each address need
to be zeroes. We could push this down to the protocols to
behave more optimally (actually something I started and didn't
finish some time ago).
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html