On Thu, Jul 14, 2022, Paolo Bonzini wrote: > On 7/14/22 10:06, Gavin Shan wrote: > > In rseq_test, there are two threads created. Those two threads are > > 'main' and 'migration_thread' separately. We also have the assumption > > that non-migration status on 'migration-worker' thread guarantees the > > same non-migration status on 'main' thread. Unfortunately, the assumption > > isn't true. The 'main' thread can be migrated from one CPU to another > > one between the calls to sched_getcpu() and READ_ONCE(__rseq.cpu_id). > > The following assert is raised eventually because of the mismatched > > CPU numbers. > > > > The issue can be reproduced on arm64 system occasionally. > > Hmm, this does not seem a correct patch - the threads are already > synchronizing using seq_cnt, like this: > > migration main > ---------------------- -------------------------------- > seq_cnt = 1 > smp_wmb() > snapshot = 0 > smp_rmb() > cpu = sched_getcpu() reads 23 > sched_setaffinity() > rseq_cpu = __rseq.cpuid reads 35 > smp_rmb() > snapshot != seq_cnt -> retry > smp_wmb() > seq_cnt = 2 > > sched_setaffinity() is guaranteed to block until the task is enqueued on an > allowed CPU. Yes, and retrying could suppress detection of kernel bugs that this test is intended to catch. > Can you check that smp_rmb() and smp_wmb() generate correct instructions on > arm64? That seems like the most likely scenario (or a kernel bug), I distinctly remember the barriers provided by tools/ being rather bizarre.