Re: [PATCH nf] x_tables: Properly close read section with read_seqcount_retry

Will Deacon <will@xxxxxxxxxx> · Wed, 18 Nov 2020 12:13:24 +0000

Hi all,

[+tglx and peterz for seqcount abuse]

On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote:
> subashab@xxxxxxxxxxxxxx <subashab@xxxxxxxxxxxxxx> wrote:
> > > > Unfortunately we are seeing it on ARM64 regression systems which
> > > > runs a
> > > > variety of
> > > > usecases so the exact steps are not known.
> > > 
> > > Ok.  Would you be willing to run some of those with your suggested
> > > change to see if that resolves the crashes or is that so rare that this
> > > isn't practical?
> > 
> > I can try that out. Let me know if you have any other suggestions as well
> > and I can try that too.
> > 
> > I assume we cant add locks here as it would be in the packet processing
> > path.
> 
> Yes.  We can add a synchronize_net() in xt_replace_table if needed
> though, before starting to put the references on the old ruleset
> This would avoid the free of the jumpstack while skbs are still
> in-flight.

I tried to make sense of what's going on here, and it looks to me like
the interaction between xt_replace_table() and ip6t_do_table() relies on
store->load order that is _not_ enforced in the code.

xt_replace_table() does:

	table->private = newinfo;

	/* make sure all cpus see new ->private value */
	smp_wmb();

	/*
	 * Even though table entries have now been swapped, other CPU's
	 * may still be using the old entries...
	 */
	local_bh_enable();

	/* ... so wait for even xt_recseq on all cpus */
	for_each_possible_cpu(cpu) {
		seqcount_t *s = &per_cpu(xt_recseq, cpu);
		u32 seq = raw_read_seqcount(s);

		if (seq & 1) {
			do {
				cond_resched();
				cpu_relax();
			} while (seq == raw_read_seqcount(s));
		}
	}

and I think the idea here is that we swizzle the table->private pointer
to point to the new data, and then wait for all pre-existing readers to
finish with the old table (isn't this what RCU is for?).

On the reader side, ip6t_do_table() does:

	addend = xt_write_recseq_begin();
	private = READ_ONCE(table->private); /* Address dependency. */

where xt_write_recseq_begin() is pretty creative:

	unsigned int addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
	__this_cpu_add(xt_recseq.sequence, addend);
	smp_wmb();
	return addend;

I think this can be boiled down to an instance of the "SB" litmus test:

CPU 0				CPU 1

Wtable = newinfo		Rseq = 0
smp_wmb()			Wseq = 1	(i.e. seq += 1)
Rseq = 0			smp_wmb()
<free oldinfo>			Rtable = oldinfo

Since smp_wmb() only orders store-store, then both CPUs can hoist the reads
up and we get the undesirable outcome. This should be observable on x86 too.

Given that this appears to be going wrong in practice, I've included a quick
hack below which should fix the ordering. However, I think it would be
better if we could avoid abusing the seqcount code for this sort of thing.

Cheers,

Will

--->8

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
 	 * since addend is most likely 1
 	 */
 	__this_cpu_add(xt_recseq.sequence, addend);
-	smp_wmb();
+	smp_mb();
 
 	return addend;
 }
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe85e2c..383a47ff0c40 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1384,10 +1384,12 @@ xt_replace_table(struct xt_table *table,
 	 * private.
 	 */
 	smp_wmb();
-	table->private = newinfo;
-
-	/* make sure all cpus see new ->private value */
-	smp_wmb();
+	WRITE_ONCE(table->private, newinfo);
+	/*
+	 * make sure all cpus see new ->private value before we load the
+	 * sequence counters
+	 */
+	smp_mb();
 
 	/*
 	 * Even though table entries have now been swapped, other CPU's