Marc Zyngier <maz@xxxxxxxxxx> writes:
On Fri, 10 Mar 2023 19:26:47 +0000, Colton Lewis <coltonlewis@xxxxxxxxxx> wrote:
Marc Zyngier <maz@xxxxxxxxxx> writes:
>> mvbbq9:/data/coltonlewis/ecv/arm64-obj/kselftest/kvm# >> ./aarch64/arch_timer -O 0xffff >> ==== Test Assertion Failure ==== >> aarch64/arch_timer.c:239: false >> pid=48094 tid=48095 errno=4 - Interrupted system call >> 1 0x4010fb: test_vcpu_run at arch_timer.c:239 >> 2 0x42a5bf: start_thread at pthread_create.o:0 >> 3 0x46845b: thread_start at clone.o:0 >> Failed guest assert: xcnt >= cval at aarch64/arch_timer.c:151 >> values: 2500645901305, 2500645961845; 9939, vcpu 0; stage; 3; iter: 2
> The fun part is that you can see similar things without the series:
> ==== Test Assertion Failure ==== > aarch64/arch_timer.c:239: false > pid=647 tid=651 errno=4 - Interrupted system call > 1 0x00000000004026db: test_vcpu_run at arch_timer.c:239 > 2 0x00007fffb13cedd7: ?? ??:0 > 3 0x00007fffb1437e9b: ?? ??:0 > Failed guest assert: config_iter + 1 == irq_iter at > aarch64/arch_timer.c:188 > values: 2, 3; 0, vcpu 3; stage; 4; iter: 3
> That's on a vanilla kernel (6.2-rc4) on an M1 with the test run > without any argument in a loop. After a few iterations, it blows.
I finally got to the bottom of that one. This is yet another case of the test making the assumption that spurious interrupts don't exist...
That's great!
Here, the timer interrupt has been masked at the source, but the GIC (or its emulation) can be slow to retire it. So we take it again, spuriously, and account it as a true interrupt. None of the asserts in the timer handler fire because they only check the *previous* state.
Eventually, the interrupt retires and we progress to the next iteration. But in the meantime, we have incremented the irq counter by the number of spurious events, and the test fails.
The obvious fix is to check for the timer state in the handler and exit early if the timer interrupt is masked or the timer disabled. With that, I don't see these failures anymore.
I've folded that into the patch that already deals with some spurious events.
I'll be looking at it and will keep in mind your questions about my hardware should I find any issues. Yes it has ECV and CNTPOFF but no I didn't try turning it off for this because my issue occured only when setting a physical offset and that can't be done without ECV.