On Wed, May 08, 2024 at 12:32:38AM +0200, Pablo Neira Ayuso wrote: > Yes, a simple reproducer would be good to have. ack, will send one soon. How does that typically look, a shell script? > You are focusing on the "set up / tear down thousands of GTP tunnels" > problems in this email, correct? A bit of context, the general interest came about from my job, working on osmo-upf and osmo-hnbgw, but this is now shifting to a personal interest that isn't necessarily backed by my employer. I think the most narrow focus is this: compare time taken adding - the first chain to - the 1001st chain to - the 10001st chain. My experience shows that each additional chain takes longer than the one before. Likewise, deleting one of 10000 chains takes significantly longer than deleting one of 1000 chains, which in turn takes significantly longer than deleting one of 10 chains in a table. I suspect inefficiency in the very basic handling of chains per se, and not in particular my type of chain (rewriting UDP packets). But for reference of what kinds of chains I am using, you can look at this link: > > https://gitea.osmocom.org/cellular-infrastructure/osmo-upf/src/commit/a21bcec358a5147deb15d156700279f52386a7d7/tests/nft-rule.vty My aim here is a general approach: how efficient does nftables work for large numbers of chains and rules? - could we scale down use of temporary memory? - could we more efficiently free temporary memory? - could we do the freeing maintenance work *after* returning the results? And the most important question I am asking here is: are things like this already known issues or are we opening a new angle to things? > In you scenario: Is there a nft_run_cmd_from_buffer() call for each > new chain? I tried batching of something like 400 chain additions, but it does not make any significant difference. > > Then I can also retrieve the counters in batches of 100, which might be more > > efficient and better to coordinate with concurrent tasks. > > This means, your existing approach is not batching updates? Reconsidering, I'd rather not mix this aspect into this mail thread, my main interest is in the quite general performance considerations above. (This is referring to another use case, sorry for the confusion. That use case is about retrieving counters from all chains in a rule set efficiently. fetching all at once is slow, so fetching in batches might help. But let's ignore that aspect for now). > > - can we fix that? Is there some memory leak / unnecessary blow up happening > > that causes this apparent factor 1000 in effort? > > tests regularly run valgrind and ASAN to ensure no memleaks. Sorry, wrong word, I meant not really a memleak proper, but a sort of overkill use of memory: given that all temporary allocations are properly cleaned up later, it can still be a sort of "leak" if the nr of those allocations explodes. For example, a rapid loop could maybe use a single allocation all over instead of one per iteration, or some data tree need not necessarily be copied for just some read-only access... that kind of thing in general. ~N