Re: nftables: Writers starve readers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jun 02, 2023 at 02:23:53PM +0200, Phil Sutter wrote:
> On Thu, Jun 01, 2023 at 10:06:24PM +0200, Pablo Neira Ayuso wrote:
> > On Thu, Jun 01, 2023 at 05:11:05PM +0200, Florian Westphal wrote:
> > > Phil Sutter <phil@xxxxxx> wrote:
> > > > A call to 'nft list ruleset' in a second terminal hangs without output.
> > > > It apparently hangs in nft_cache_update() because rule_cache_dump()
> > > > returns EINTR. On kernel side, I guess it stems from
> > > > nl_dump_check_consistent() in __nf_tables_dump_rules(). I haven't
> > > > checked, but the generation counter likely increases while dumping the
> > > > 100k rules.
> > > 
> > > Yes.
> > > 
> > > > One may deem this scenario unrealistic, but I had to insert a 'sleep 5'
> > > > into the while-loop to unblock 'nft list ruleset' again. A new rule
> > > > every 4s especially in such a large ruleset is not that unrealistic IMO.
> > > 
> > > Several seconds is very strange indeed, how is the data that needs
> > > to be transferred to userspace and how large is the buffer provided
> > > during dumps? strace would help here.
> > > 
> > > If thats rather small, then dumping a chain with 10k rules may
> > > have to re-iterate the existig list for long time before it finds
> > > the starting point on where to resume the dump.
> > > 
> > > > I wonder if we can provide some fairness to readers? Ideally a reader
> > > > would just see the ruleset as it was when it started dumping, but
> > > > keeping a copy of the large ruleset is probably not feasible.
> > > 
> > > I can't think of a good solution.  We could add a
> > > "--allow-inconsistent-dump" flag to nftables that disables the restart
> > > logic for -EINTR case, but we can't make that the default unfortunately.
> > > 
> > > Or we could experiment with serializing the remaining rules into a
> > > private kernel-side kmalloc'd buffer once the userspace buffer is
> > > full, then copy from that buffer on resume without the inconsistency check.
> > > 
> > > I don't think that we can solve this, slowing down writers when there
> > > are dumpers will load to the same issue, just in the oppostite direction.
> > 
> > There are currently two pending issues that, if addressed, could
> > improve things:
> > 
> > NLM_F_INTR is set on in case writer infers with a reader, currently
> > forcing userspace to read all of the remaining messages to leave
> > things in consistent state, otherwise next dump request hits EILSEQ in
> > libmnl. Before 6d085b22a8b5 ("table: support for the table owner
> > flag"), the socket was closed and reopen to workaround this issue.
> > There should be a way to discard the ongoing netlink dump without
> > having to read the remaining messages, that should also improve
> > things.
> 
> I tried restoring the immediate return from nft_mnl_recv() adding socket
> close and open calls to sanitize things. Assuming my changes are
> correct, they don't have a noticeable effect: The same test-case still
> allows for a 4s delay in the rule add'n'delete loop to starve 'nft list
> ruleset'.

Not for this particular torture test, but for some other usercases, it
might be useful.

> > It should be possible to add generation counters per object type, so
> > userspace does not have to ditch all what it has in its cache, only
> > what it has changed. Currently the generation counter is global.
> 
> I guess the added complexity is probably not worth it. Kernel-side could
> be pretty simple, but user space could no longer rely upon
> nft_cache::genid but had to fetch each object's genid to check if the
> local cache is outdated, plus I have no idea how one would detect that a
> new table was added.

I made a patch to add a fall back, it displays a warning:

# Warning: ruleset has been updated while listing

it falls back to this mode if it takes more than 5 seconds to fetch a
consistent ruleset.

I think I still need a new function to make consistency check, in case
rule refering to unexisting chain, ie. top-level object in the
hierarchy is missing, to avoid crashes. Such code would only exercised
in case this fallback mode is enabled.

See attachment.
diff --git a/include/cache.h b/include/cache.h
index 934c3a74fa95..c5c585bacaf7 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -136,6 +136,7 @@ struct nft_cache {
 	struct cache		table_cache;
 	uint32_t		seqnum;
 	uint32_t		flags;
+	bool			eintr;
 };
 
 void nft_chain_cache_update(struct netlink_ctx *ctx, struct table *table,
diff --git a/include/netlink.h b/include/netlink.h
index d52434c72bc2..d3e828f270e1 100644
--- a/include/netlink.h
+++ b/include/netlink.h
@@ -85,6 +85,7 @@ struct netlink_ctx {
 	uint32_t		seqnum;
 	struct nftnl_batch	*batch;
 	int			maybe_emsgsize;
+	bool			ignore_eintr;
 };
 
 extern struct nftnl_expr *alloc_nft_expr(const char *name);
diff --git a/src/cache.c b/src/cache.c
index 95adee7f8ac1..9237397d2d3e 100644
--- a/src/cache.c
+++ b/src/cache.c
@@ -15,6 +15,7 @@
 #include <netlink.h>
 #include <mnl.h>
 #include <libnftnl/chain.h>
+#include <sys/time.h>
 #include <linux/netfilter.h>
 #include <linux/netfilter/nf_tables.h>
 
@@ -1175,10 +1176,14 @@ bool nft_cache_needs_update(struct nft_cache *cache)
 	return cache->flags & NFT_CACHE_UPDATE;
 }
 
+/* display inconsistent ruleset after 5 seconds of retries. */
+#define MAX_RETRY_TIME 5
+
 int nft_cache_update(struct nft_ctx *nft, unsigned int flags,
 		     struct list_head *msgs,
 		     const struct nft_cache_filter *filter)
 {
+	struct timeval tv_start, tv_last, tv_diff;
 	struct netlink_ctx ctx = {
 		.list		= LIST_HEAD_INIT(ctx.list),
 		.nft		= nft,
@@ -1187,7 +1192,14 @@ int nft_cache_update(struct nft_ctx *nft, unsigned int flags,
 	struct nft_cache *cache = &nft->cache;
 	uint32_t genid, genid_stop, oldflags;
 	int ret;
+
+	gettimeofday(&tv_start, NULL);
 replay:
+	gettimeofday(&tv_last, NULL);
+	timersub(&tv_diff, &tv_last, &tv_start);
+	if (tv_diff.tv_sec > MAX_RETRY_TIME)
+		ctx.ignore_eintr = true;
+
 	ctx.seqnum = cache->seqnum++;
 	genid = mnl_genid_get(&ctx);
 	if (!nft_cache_needs_refresh(cache) &&
@@ -1223,12 +1235,16 @@ replay:
 
 	genid_stop = mnl_genid_get(&ctx);
 	if (genid != genid_stop) {
-		nft_cache_release(cache);
-		goto replay;
+		if (!ctx.ignore_eintr) {
+			nft_cache_release(cache);
+			goto replay;
+		}
+		cache->eintr = true;
 	}
 skip:
 	cache->genid = genid;
 	cache->flags = flags;
+
 	return 0;
 }
 
diff --git a/src/mnl.c b/src/mnl.c
index 91775c41b246..f985e90c5926 100644
--- a/src/mnl.c
+++ b/src/mnl.c
@@ -706,8 +706,13 @@ struct nftnl_rule_list *mnl_nft_rule_dump(struct netlink_ctx *ctx, int family,
 	}
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, rule_cb, nlr_list);
-	if (ret < 0)
-		goto err;
+	if (ret < 0) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nlr_list;
 err:
@@ -1029,8 +1034,13 @@ struct nftnl_chain_list *mnl_nft_chain_dump(struct netlink_ctx *ctx,
 	}
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, chain_cb, nlc_list);
-	if (ret < 0 && errno != ENOENT)
-		goto err;
+	if (ret < 0 && errno != ENOENT) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nlc_list;
 err:
@@ -1174,8 +1184,13 @@ struct nftnl_table_list *mnl_nft_table_dump(struct netlink_ctx *ctx,
 	}
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, table_cb, nlt_list);
-	if (ret < 0 && errno != ENOENT)
-		goto err;
+	if (ret < 0 && errno != ENOENT) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nlt_list;
 err:
@@ -1428,8 +1443,13 @@ mnl_nft_set_dump(struct netlink_ctx *ctx, int family,
 		memory_allocation_error();
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, set_cb, nls_list);
-	if (ret < 0 && errno != ENOENT)
-		goto err;
+	if (ret < 0 && errno != ENOENT) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nls_list;
 err:
@@ -1653,8 +1673,13 @@ mnl_nft_obj_dump(struct netlink_ctx *ctx, int family,
 		memory_allocation_error();
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, obj_cb, nln_list);
-	if (ret < 0)
-		goto err;
+	if (ret < 0) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nln_list;
 err:
@@ -1967,8 +1992,13 @@ mnl_nft_flowtable_dump(struct netlink_ctx *ctx, int family,
 		memory_allocation_error();
 
 	ret = nft_mnl_talk(ctx, nlh, nlh->nlmsg_len, flowtable_cb, nln_list);
-	if (ret < 0 && errno != ENOENT)
-		goto err;
+	if (ret < 0 && errno != ENOENT) {
+		if (errno != EINTR)
+			goto err;
+
+		if (!ctx->ignore_eintr)
+			goto err;
+	}
 
 	return nln_list;
 err:
diff --git a/src/rule.c b/src/rule.c
index 633a5a12486d..3f984ea1fec5 100644
--- a/src/rule.c
+++ b/src/rule.c
@@ -2192,6 +2192,9 @@ static int do_list_ruleset(struct netlink_ctx *ctx, struct cmd *cmd)
 	unsigned int family = cmd->handle.family;
 	struct table *table;
 
+	if (ctx->nft->cache.eintr)
+		nft_print(&ctx->nft->output, "# Warning: ruleset has been updated while listing\n");
+
 	list_for_each_entry(table, &ctx->nft->cache.table_cache.list, cache.list) {
 		if (family != NFPROTO_UNSPEC &&
 		    table->handle.family != family)

[Index of Archives]     [Netfitler Users]     [Berkeley Packet Filter]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux