Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 7 Mar 2024 14:09:00 -0800

On Thu, 7 Mar 2024 at 13:40, Julia Lawall <julia.lawall@xxxxxxxx> wrote:
>
> I tried the following:
>
> @@
> expression x;
> @@
>
> *WRITE_ONCE(x,<+...READ_ONCE(x)...+>)
>
> This gave a number of results, shown below.  Let me know if some of them
> are undesirable.

Well, all the ones you list do look like garbage.

That said, quite often the garbage does seem to be "we don't actually
care about the result". Several of them look like statistics.

Some of them look outright nasty, though:

> --- /home/julia/linux/net/netfilter/nf_tables_api.c
> +++ /tmp/nothing/net/netfilter/nf_tables_api.c
> @@ -10026,8 +10026,6 @@ static unsigned int nft_gc_seq_begin(str
>         unsigned int gc_seq;
>
>         /* Bump gc counter, it becomes odd, this is the busy mark. */
> -       gc_seq = READ_ONCE(nft_net->gc_seq);
> -       WRITE_ONCE(nft_net->gc_seq, ++gc_seq);

The above is garbage code, and the comment implies that it is garbage
code that _should_ be reliable.

> diff -u -p /home/julia/linux/fs/xfs/xfs_icache.c /tmp/nothing/fs/xfs/xfs_icache.c
> --- /home/julia/linux/fs/xfs/xfs_icache.c
> +++ /tmp/nothing/fs/xfs/xfs_icache.c
> @@ -2076,8 +2076,6 @@ xfs_inodegc_queue(
>         cpu_nr = get_cpu();
>         gc = this_cpu_ptr(mp->m_inodegc);
>         llist_add(&ip->i_gclist, &gc->list);
> -       items = READ_ONCE(gc->items);
> -       WRITE_ONCE(gc->items, items + 1);

In contrast, this is also garbage code, but the only user of it seems
to be a heuristic, so if 'items' is off by one (or by a hundred), it
probably doesn't matter.

The xfs code is basically using that 'items' count to decide if it
really wants to do GC or not.

This is actually a case where having a "UNSAFE_INCREMENTISH()" macro
might make sense.

That said, this is also a case where using a "local_t" and using
"local_add_return()" might be a better option. It falls back on true
atomics, but at least on x86 you probably get *better* code generation
for the "incrementish" operation than you get with READ_ONCE ->
WRITE_ONCE.

> diff -u -p /home/julia/linux/kernel/rcu/tree.c /tmp/nothing/kernel/rcu/tree.c
> --- /home/julia/linux/kernel/rcu/tree.c
> +++ /tmp/nothing/kernel/rcu/tree.c
> @@ -1620,8 +1620,6 @@ static void rcu_gp_fqs(bool first_time)
>         /* Clear flag to prevent immediate re-entry. */
>         if (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) {
>                 raw_spin_lock_irq_rcu_node(rnp);
> -               WRITE_ONCE(rcu_state.gp_flags,
> -                          READ_ONCE(rcu_state.gp_flags) & ~RCU_GP_FLAG_FQS);
>                 raw_spin_unlock_irq_rcu_node(rnp);

This smells bad to me. The code is holding a lock, but apparently not
one that protects gp_flags.

And that READ_ONCE->WRITE_ONCE sequence can corrupt all the other flags.

Maybe it's fine for some reason (that reason being either that the
ONCE operations aren't actually needed at all, or because nobody
*really* cares about the flags), but it smells.

> @@ -1882,8 +1880,6 @@ static void rcu_report_qs_rsp(unsigned l
>  {
>         raw_lockdep_assert_held_rcu_node(rcu_get_root());
>         WARN_ON_ONCE(!rcu_gp_in_progress());
> -       WRITE_ONCE(rcu_state.gp_flags,
> -                  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
>         raw_spin_unlock_irqrestore_rcu_node(rcu_get_root(), flags);

Same field, same lock held, same odd smelly pattern.

> -       WRITE_ONCE(rcu_state.gp_flags,
> -                  READ_ONCE(rcu_state.gp_flags) | RCU_GP_FLAG_FQS);
>         raw_spin_unlock_irqrestore_rcu_node(rnp_old, flags);

.. and again.

> --- /home/julia/linux/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.c
> @@ -80,8 +80,6 @@ static int cn23xx_vf_reset_io_queues(str
>                                 q_no);
>                         return -1;
>                 }
> -               WRITE_ONCE(reg_val, READ_ONCE(reg_val) &
> -                          ~CN23XX_PKT_INPUT_CTL_RST);
>                 octeon_write_csr64(oct, CN23XX_VF_SLI_IQ_PKT_CONTROL64(q_no),
>                                    READ_ONCE(reg_val));

I suspect this is garbage that has been triggered by the usual
mindless "fix the symptoms, not the bug" as a result of a "undefined
behavior report".

>> --- /home/julia/linux/kernel/kcsan/kcsan_test.c
> +++ /tmp/nothing/kernel/kcsan/kcsan_test.c
> @@ -381,7 +381,6 @@ static noinline void test_kernel_change_
>                 test_var ^= TEST_CHANGE_BITS;
>                 kcsan_nestable_atomic_end();
>         } else
> -               WRITE_ONCE(test_var, READ_ONCE(test_var) ^ TEST_CHANGE_BITS);

Presumably this is intentionally testing whether KCSAN notices these
things at all.

> diff -u -p /home/julia/linux/arch/s390/kernel/idle.c /tmp/nothing/arch/s390/kernel/idle.c
>         /* Account time spent with enabled wait psw loaded as idle time. */
> -       WRITE_ONCE(idle->idle_time, READ_ONCE(idle->idle_time) + idle_time);
> -       WRITE_ONCE(idle->idle_count, READ_ONCE(idle->idle_count) + 1);
>         account_idle_time(cputime_to_nsecs(idle_time));

This looks like another "UNSAFE_INCREMENTISH()" case.

> --- /home/julia/linux/mm/mmap.c
> +++ /tmp/nothing/mm/mmap.c
> @@ -3476,7 +3476,6 @@ bool may_expand_vm(struct mm_struct *mm,
>
>  void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
>  {
> -       WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm)+npages);

As does this.

> diff -u -p /home/julia/linux/fs/xfs/libxfs/xfs_iext_tree.c /tmp/nothing/fs/xfs/libxfs/xfs_iext_tree.c
>  static inline void xfs_iext_inc_seq(struct xfs_ifork *ifp)
>  {
> -       WRITE_ONCE(ifp->if_seq, READ_ONCE(ifp->if_seq) + 1);
>  }

Ugh. A sequence count that is "incrementish"? That smells wrong to me.
But I didn't go look at the users. Maybe it's another case of "we
don't *actually* care about the sequence count".

>
> +++ /tmp/nothing/drivers/net/ethernet/cavium/liquidio/cn23xx_pf_device.c
> @@ -379,8 +379,6 @@ static int cn23xx_reset_io_queues(struct
>                                 q_no);
>                         return -1;
>                 }
> -               WRITE_ONCE(reg_val, READ_ONCE(reg_val) &
> -                       ~CN23XX_PKT_INPUT_CTL_RST);
> ....
> -               WRITE_ONCE(d64, READ_ONCE(d64) &
> -                                       (~(CN23XX_PKT_INPUT_CTL_RING_ENB)));
> -               WRITE_ONCE(d64, READ_ONCE(d64) | CN23XX_PKT_INPUT_CTL_RST);

More "likely wrong" cases.

> +++ /tmp/nothing/mm/kfence/kfence_test.c
> @@ -501,7 +501,6 @@ static void test_kmalloc_aligned_oob_wri
>          * fault immediately after it.
>          */
>         expect.addr = buf + size;
> -       WRITE_ONCE(*expect.addr, READ_ONCE(*expect.addr) + 1);

Looks like questionable test-code again.

> +++ /tmp/nothing/io_uring/io_uring.c
> @@ -363,7 +363,6 @@ static void io_account_cq_overflow(struc
>  {
>         struct io_rings *r = ctx->rings;
>
> -       WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
>         ctx->cq_extra--;

Bah. Looks like garbage, but the kernel doesn't actually use that
value. Looks like a random number generator exposed to user space.
Presumably this is another "statistics, but I don't care enouhg".

> @@ -2403,8 +2402,6 @@ static bool io_get_sqe(struct io_ring_ct
> -                       WRITE_ONCE(ctx->rings->sq_dropped,
> -                                  READ_ONCE(ctx->rings->sq_dropped) + 1);

As is the above.

> +++ /tmp/nothing/security/apparmor/apparmorfs.c
> @@ -596,7 +596,6 @@ static __poll_t ns_revision_poll(struct
>
>  void __aa_bump_ns_revision(struct aa_ns *ns)
>  {
> -       WRITE_ONCE(ns->revision, READ_ONCE(ns->revision) + 1);
>         wake_up_interruptible(&ns->wait);

This looks like somebody copied the RCU / tracing pattern?

> +++ /tmp/nothing/arch/riscv/kvm/vmid.c
> @@ -90,7 +90,6 @@ void kvm_riscv_gstage_vmid_update(struct
>
>         /* First user of a new VMID version? */
>         if (unlikely(vmid_next == 0)) {
> -               WRITE_ONCE(vmid_version, READ_ONCE(vmid_version) + 1);
>                 vmid_next = 1;

Looks bogus and wrong. An unreliable address space version does _not_
sound sane, but who knows.

Anyway, from a quick look, there's a mix of "this is just wrong" and a
couple of "this seems to just want approximate statistics".

Maybe the RCU 'flags' field is using WRITE_ONCE() because while the
spinlock protects the bit changes, there are readers that look at
other bits with READ_ONCE.

That would imply that the READ_ONCE->WRITE_ONCE is just broken garbage
- the WRITE_ONCE() part may be right, but the READ_ONCE is wrong
because the value is stable.

                  Linus