On Tue, 2022-08-16 at 20:02 +1000, Gavin Shan wrote: > Hi Sean, > > On 8/16/22 3:02 PM, Gavin Shan wrote: > > On 8/16/22 7:42 AM, Sean Christopherson wrote: > > > On Mon, Aug 15, 2022, kernel test robot wrote: > > > > commit: e923b0537d28e15c9d31ce8b38f810b325816903 ("KVM: > > > > selftests: Fix target thread to be migrated in rseq_test") > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git > > > > master > > > > > > ... > > > > > > > # selftests: kvm: rseq_test > > > > # ==== Test Assertion Failure ==== > > > > # rseq_test.c:278: i > (NR_TASK_MIGRATIONS / 2) > > > > # pid=49599 tid=49599 errno=4 - Interrupted system call > > > > # 1 0x000000000040265d: main at rseq_test.c:278 > > > > # 2 0x00007fe44eed07fc: ?? ??:0 > > > > # 3 0x00000000004026d9: _start at ??:? > > > > # Only performed 23174 KVM_RUNs, task stalled too much? > > > > # > > > > not ok 56 selftests: kvm: rseq_test # exit=254 > > > > > > ... > > > > > > > # Automatically generated file; DO NOT EDIT. > > > > # Linux/x86_64 5.19.0-rc6 Kernel Configuration > > > > # > > > > CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-3) 11.3.0" > > > > CONFIG_CC_IS_GCC=y > > > > CONFIG_GCC_VERSION=110300 > > > > CONFIG_CLANG_VERSION=0 > > > > CONFIG_AS_IS_GNU=y > > > > CONFIG_AS_VERSION=23800 > > > > CONFIG_LD_IS_BFD=y > > > > CONFIG_LD_VERSION=23800 > > > > CONFIG_LLD_VERSION=0 > > > > > > Assuming 23800 == 2.38, this is a known issue. > > > > > > https://lore.kernel.org/all/20220810104114.6838-1-gshan@xxxxxxxxxx > > > > > > > It's probably different story this time. The assert is triggered > > because > > of the following instructions. I would guess the reason is vcpu > > thread > > has been running on CPU where we has high CPU load. In this case, > > the > > vcpu thread can't be run in time. More specific, the vcpu thread > > can't > > be run in the 1 - 10us time window, which is specified by the > > migration > > worker (thread). > > > > TEST_ASSERT(i > (NR_TASK_MIGRATIONS / 2), > > "Only performed %d KVM_RUNs, task stalled too > > much?\n", i); > > > > I think we need to improve the handshake mechanism between the vcpu > > thread > > and migration worker. In current implementation, the handshake is > > done through > > the atomic counter. The mechanism is simple enough, but vcpu thread > > can miss > > the aforementioned time window. Another issue is the test case much > > more time > > than expected to finish. > > > > Sean, if you think it's reasonable, I can figure out something to > > do: > > > > - Reuse the atomic counter for a full synchronization between these > > two > > threads. Something like below: > > > > #define RSEQ_TEST_STATE_RUN_VCPU 0 // vcpu_run() > > #define RSEQ_TEST_STATE_MIGRATE 1 // > > sched_setaffinity() > > #define RSEQ_TEST_STATE_CHECK 2 // Check > > rseq.cpu_id and get_cpu() > > > > The atomic counter is reset to RSEQ_TEST_STATE_RUN_VCPU after > > RSEQ_TEST_STATE_RUN_VCPU > > > > - Reduce NR_TASK_MIGRATIONS from 100000 to num_of_online_cpus(). > > With this, > > less time is needed to finish the test case. > > > > I'm able to recreate the issue on my local arm64 system. > > - From the source code, the iteration count is changed from 100000 to > 1000 > - Only CPU#0 and CPU#1 are exposed in calc_min_max_cpu, meaning other > CPUs > are cleared from @possible_mask > - Run some CPU bound task on CPU#0 and CPU#1 > # while true; do taskset -c 0 ./a; done > # while true; do taskset -c 1 ./a; done > - Run 'rseq_test' and hit the issue > # ./rseq_test > calc_min_max_cpu: nproc=224 > calc_min_max_cpu: min_cpu=0, max_cpu=1 > main: Required tests: 1000 Succeeds: 1 > ==== Test Assertion Failure ==== > rseq_test.c:280: i > (NR_TASK_MIGRATIONS / 2) > pid=9624 tid=9624 errno=4 - Interrupted system call > 1 0x0000000000401cf3: main at rseq_test.c:280 > 2 0x0000ffffbc64679b: ?? ??:0 > 3 0x0000ffffbc64686b: ?? ??:0 > 4 0x0000000000401def: _start at ??:? > Only performed 1 KVM_RUNs, task stalled too much? > I met this as well, and can reproduce this easily on my Cascade-Lake machine, without heavy workload; before I find this mail. I'm wondering if it's because vcpu_run() has become much more time consuming than this test case was defined? BTW, if we substitute sys_getcpu(&cpu) with vdso_getcpu(), can reduce the reproduction odds. > Thanks, > Gavin > >