Hi! I am trying to figure out why adding/deleting chains and retrieving counters via nftables takes long for high nr of entries. I'd like to share some findings, to ask whether this is known, whether a solution exists, etc... I'm writing two programs that interact with nftables via nft_run_cmd_from_buffer(). (osmo-upf and osmo-hnbgw) One use case is setting up / tearing down thousands of GTP tunnels via nftables rule sets one by one, for reference: https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty The other use case is retrieving all counters that are currently present in the inet table that my client program owns. Adding the first few hundred chains / reading a few hundred counters finishes in an ok amount of time. But the more chains are already submitted in the table, the longer it takes to add another one, etc. By the time that there are ~4000 chains, adding another chain becomes prohibitively slow. Some numbers: in total, when setting up 1000 GTP tunnels in osmo-upf, which means adding 4000 chains into a table owned by the program (osmo-upf), in total takes 49 seconds. This includes the overhead for talking PFCP and logging, etc., but by far the most time is taken waiting for nft_run_cmd_from_buffer(). We'd like to scale up to like 100 times the above -- completely beyond all reason currently, since the wait time seems to increase exponentially. I have two angles: 1) workaround: structure the chains and rules differently? 2) analysis: bpftracing tells me that most time is spent in chain_free(). (1) Currently I have one flat table with all the chains in that "top level". Would it make sense to try breaking that long list up into smaller groups, maybe batches of 100? As in, pseudocode: table osmo-upf { group-0 { chain-0 {} chain-1 {} ... } group-1 { chain-100 {} chain-101 {} ... } ... group-1000 { chain-100000 {} chain-100001 {} ... } } Then I can also retrieve the counters in batches of 100, which might be more efficient and better to coordinate with concurrent tasks. (2) Using bpftrace, I drilled a bit into where the time is spent. Two interesting findings for me so far: It seems most time is spent in nft_run_cmd_from_buffer nft_evaluate nft_cache_update nft_cache_release nft_cache_flush table_free chain_free The bpftrace shows: nft_cache_update() was called 6000 times. (I have 1000 GTP tunnels and expect four chains per GTP tunnel, which would be 4000, not sure why I see 6k, but that's not really that significant.) chain_free() was called more than 5 million times, that's 1000 times as often as I would naively expect to see. Most calls are fast (<16us), but there are some break-outs of up to 8ms, and in total the 5 million calls amount to 80 seconds. (One bpftrace dump below, FYI) I'm continuing to think: - Is this stuff known already, or does it make sense to continue on this path, share a reproduction recipe here, and so on? - can we fix that? Is there some memory leak / unnecessary blow up happening that causes this apparent factor 1000 in effort? - can we defer the free, so my client program gets the nftables results without having to wait for the freeing? Can the client program decide when to do the nftables freeing, i.e. not within nft_run_cmd_from_buffer()? Thanks! ~N foo_bar__count: function foo_bar() was called this many times ...__ms: milliseconds ...__us: microseconds ========================================================== @handle_free__count: 5190516 @handle_free__total_ms: 16919 <--- in total 17 seconds spent in handle_free() @handle_free__us: [2, 4) 4884915 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 280031 |@@ | [8, 16) 15323 | | [16, 32) 9910 | | [32, 64) 267 | | [64, 128) 24 | | [128, 256) 3 | | [256, 512) 3 | | [512, 1K) 0 | | [1K, 2K) 3 | | [2K, 4K) 48 | | @cache_free__count: 11100 @cache_free__total_ms: 230 @cache_free__us: [2, 4) 7327 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 3461 |@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 177 |@ | [16, 32) 73 | | [32, 64) 5 | | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 9 | | [2K, 4K) 39 | | [4K, 8K) 9 | | @scope_release__count: 5140608 @scope_release__total_ms: 12679 <--- in total 13 seconds spent in scope_release() @scope_release__us: [1] 1580615 |@@@@@@@@@@@@@@@@@@@@@@@ | [2, 4) 3527447 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 21717 | | [8, 16) 4521 | | [16, 32) 6165 | | [32, 64) 140 | | [64, 128) 10 | | [128, 256) 0 | | [256, 512) 5 | | @set_free__count: 16650 @set_free__total_ms: 118 @set_free__us: [1] 2527 |@@@@@@@@@@@@@@@@@@@ | [2, 4) 6643 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4, 8) 1843 |@@@@@@@@@@@@@@ | [8, 16) 3475 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16, 32) 2119 |@@@@@@@@@@@@@@@@ | [32, 64) 37 | | [64, 128) 3 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 3 | | @chain_free__count: 5137887 <--- called 5G times @chain_free__total_ms: 80114 <--- in total 80! seconds spent in scope_release() @chain_free__us: [8, 16) 4556941 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 550786 |@@@@@@ | [32, 64) 29604 | | [64, 128) 397 | | [128, 256) 12 | | [256, 512) 21 | | [512, 1K) 0 | | [1K, 2K) 3 | | [2K, 4K) 117 | | [4K, 8K) 15 | | ========================================================== -- - Neels Hofmeyr <nhofmeyr@xxxxxxxxxxx> https://www.sysmocom.de/ ======================================================================= * sysmocom - systems for mobile communications GmbH * Siemensstr. 26a * 10551 Berlin, Germany * Sitz / Registered office: Berlin, HRB 134158 B * Geschaeftsfuehrer / Managing Director: Harald Welte