On Thu, Jun 01, 2023 at 10:06:24PM +0200, Pablo Neira Ayuso wrote: > On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote: > > Phil Sutter <phil@xxxxxx> wrote: > > > A call to 'nft list ruleset' in a second terminal hangs without output. > > > It apparently hangs in nft_cache_update() because rule_cache_dump() > > > returns EINTR. On kernel side, I guess it stems from > > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't > > > checked, but the generation counter likely increases while dumping the > > > 100k rules. > > > > Yes. > > > > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5' > > > into the while-loop to unblock 'nft list ruleset' again. A new rule > > > every 4s especially in such a large ruleset is not that unrealistic IMO. > > > > Several seconds is very strange indeed, how is the data that needs > > to be transferred to userspace and how large is the buffer provided > > during dumps? strace would help here. > > > > If thats rather small, then dumping a chain with 10k rules may > > have to re-iterate the existig list for long time before it finds > > the starting point on where to resume the dump. > > > > > I wonder if we can provide some fairness to readers? Ideally a reader > > > would just see the ruleset as it was when it started dumping, but > > > keeping a copy of the large ruleset is probably not feasible. > > > > I can't think of a good solution. We could add a > > "--allow-inconsistent-dump" flag to nftables that disables the restart > > logic for -EINTR case, but we can't make that the default unfortunately. > > > > Or we could experiment with serializing the remaining rules into a > > private kernel-side kmalloc'd buffer once the userspace buffer is > > full, then copy from that buffer on resume without the inconsistency check. > > > > I don't think that we can solve this, slowing down writers when there > > are dumpers will load to the same issue, just in the oppostite direction. > > There are currently two pending issues that, if addressed, could > improve things: > > NLM_F_INTR is set on in case writer infers with a reader, currently > forcing userspace to read all of the remaining messages to leave > things in consistent state, otherwise next dump request hits EILSEQ in > libmnl. Before 6d085b22a8b5 ("table: support for the table owner > flag"), the socket was closed and reopen to workaround this issue. > There should be a way to discard the ongoing netlink dump without > having to read the remaining messages, that should also improve > things. I tried restoring the immediate return from nft_mnl_recv() adding socket close and open calls to sanitize things. Assuming my changes are correct, they don't have a noticeable effect: The same test-case still allows for a 4s delay in the rule add'n'delete loop to starve 'nft list ruleset'. > It should be possible to add generation counters per object type, so > userspace does not have to ditch all what it has in its cache, only > what it has changed. Currently the generation counter is global. I guess the added complexity is probably not worth it. Kernel-side could be pretty simple, but user space could no longer rely upon nft_cache::genid but had to fetch each object's genid to check if the local cache is outdated, plus I have no idea how one would detect that a new table was added. Cheers, Phil