On Wed, Nov 18, 2020 at 01:42:28PM +0100, Florian Westphal wrote: > Will Deacon <will@xxxxxxxxxx> wrote: > > On Mon, Nov 16, 2020 at 07:20:28PM +0100, Florian Westphal wrote: > > > subashab@xxxxxxxxxxxxxx <subashab@xxxxxxxxxxxxxx> wrote: > > > > > > Unfortunately we are seeing it on ARM64 regression systems which > > > > > > runs a > > > > > > variety of > > > > > > usecases so the exact steps are not known. > > > > > > > > > > Ok. Would you be willing to run some of those with your suggested > > > > > change to see if that resolves the crashes or is that so rare that this > > > > > isn't practical? > > > > > > > > I can try that out. Let me know if you have any other suggestions as well > > > > and I can try that too. > > > > > > > > I assume we cant add locks here as it would be in the packet processing > > > > path. > > > > > > Yes. We can add a synchronize_net() in xt_replace_table if needed > > > though, before starting to put the references on the old ruleset > > > This would avoid the free of the jumpstack while skbs are still > > > in-flight. > > > > I tried to make sense of what's going on here, and it looks to me like > > the interaction between xt_replace_table() and ip6t_do_table() relies on > > store->load order that is _not_ enforced in the code. > > > > xt_replace_table() does: > > > > table->private = newinfo; > > > > /* make sure all cpus see new ->private value */ > > smp_wmb(); > > > > /* > > * Even though table entries have now been swapped, other CPU's > > * may still be using the old entries... > > */ > > local_bh_enable(); > > > > /* ... so wait for even xt_recseq on all cpus */ > > for_each_possible_cpu(cpu) { > > seqcount_t *s = &per_cpu(xt_recseq, cpu); > > u32 seq = raw_read_seqcount(s); > > > > if (seq & 1) { > > do { > > cond_resched(); > > cpu_relax(); > > } while (seq == raw_read_seqcount(s)); > > } > > } > > > > and I think the idea here is that we swizzle the table->private pointer > > to point to the new data, and then wait for all pre-existing readers to > > finish with the old table (isn't this what RCU is for?). > > Yes and yes. Before the current 'for_each_possible_cpu() + seqcount > abuse the abuse was just implicit; xt_replace_table did NOT wait for > readers to go away and relied on arp, eb and ip(6)tables to store the > counters back to userspace after xt_replace_table(). Thanks for the background. > This meant a read_seqcount_begin/retry for each counter & eventually > gave complaits from k8s users that x_tables rule replacement was too > slow. > > [..] > > > Given that this appears to be going wrong in practice, I've included a quick > > hack below which should fix the ordering. However, I think it would be > > better if we could avoid abusing the seqcount code for this sort of thing. > > I'd be ok with going via the simpler solution & wait if k8s users > complain that its too slow. Those memory blobs can be huge so I would > not use call_rcu here. If you can stomach the cost of synchronize_rcu() then this at least gets rid of that for_each_possible_cpu() loop! Also... > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > index af22dbe85e2c..b5911985d1eb 100644 > --- a/net/netfilter/x_tables.c > +++ b/net/netfilter/x_tables.c > @@ -1384,7 +1384,7 @@ xt_replace_table(struct xt_table *table, > * private. > */ > smp_wmb(); > - table->private = newinfo; > + WRITE_ONCE(table->private, newinfo); ... you could make this rcu_assign_pointer() and get rid of the preceding smp_wmb()... > > /* make sure all cpus see new ->private value */ > smp_wmb(); ... and this smp_wmb() is no longer needed because synchronize_rcu() takes care of the ordering. > @@ -1394,19 +1394,7 @@ xt_replace_table(struct xt_table *table, > * may still be using the old entries... > */ > local_bh_enable(); > - > - /* ... so wait for even xt_recseq on all cpus */ > - for_each_possible_cpu(cpu) { > - seqcount_t *s = &per_cpu(xt_recseq, cpu); > - u32 seq = raw_read_seqcount(s); > - > - if (seq & 1) { > - do { > - cond_resched(); > - cpu_relax(); > - } while (seq == raw_read_seqcount(s)); > - } > - } Do we still need xt_write_recseq_{begin,end}() with this removed? Cheers, Will