On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote: > Phil Sutter <phil@xxxxxx> wrote: > > A call to 'nft list ruleset' in a second terminal hangs without output. > > It apparently hangs in nft_cache_update() because rule_cache_dump() > > returns EINTR. On kernel side, I guess it stems from > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't > > checked, but the generation counter likely increases while dumping the > > 100k rules. > > Yes. > > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5' > > into the while-loop to unblock 'nft list ruleset' again. A new rule > > every 4s especially in such a large ruleset is not that unrealistic IMO. > > Several seconds is very strange indeed, how is the data that needs > to be transferred to userspace and how large is the buffer provided > during dumps? strace would help here. Each recvmsg() call returns 32KB, grepping for NFT_MSG_NEWRULE returns 4290 lines. | # time ./src/nft list ruleset | wc -l | # Warning: table ip filter is managed by iptables-nft, do not touch! | # Warning: table ip nat is managed by iptables-nft, do not touch! | # Warning: table ip mangle is managed by iptables-nft, do not touch! | 100190 | | real 0m5.572s | user 0m1.014s | sys 0m4.885s > If thats rather small, then dumping a chain with 10k rules may > have to re-iterate the existig list for long time before it finds > the starting point on where to resume the dump. To my surprise, the mnl_nft_rule_dump() code-path does not call mnl_set_rcvbuffer(). Though explicitly calling it from nft_mnl_talk() passing 1<<24 as buffer size does not lead to different behaviour. I seem to recall the 32k was a kernel-side limit in netlink? > > I wonder if we can provide some fairness to readers? Ideally a reader > > would just see the ruleset as it was when it started dumping, but > > keeping a copy of the large ruleset is probably not feasible. > > I can't think of a good solution. We could add a > "--allow-inconsistent-dump" flag to nftables that disables the restart > logic for -EINTR case, but we can't make that the default unfortunately. > > Or we could experiment with serializing the remaining rules into a > private kernel-side kmalloc'd buffer once the userspace buffer is > full, then copy from that buffer on resume without the inconsistency check. > > I don't think that we can solve this, slowing down writers when there > are dumpers will load to the same issue, just in the oppostite direction. You're probably right, thanks for spending cycles on it. Cheers, Phil