Hi Zenghui, On 11/30/20 1:59 PM, Zenghui Yu wrote: > Hi Alex, > > On 2020/11/27 22:50, Alexandru Elisei wrote: >> Hi Zhenghui, >> >> Thank you for having a look at this! >> >> On 11/26/20 9:30 AM, Zenghui Yu wrote: >>> On 2020/11/25 23:51, Alexandru Elisei wrote: >>>> The reason for the failure is that the test "dev2/eventid=20 now triggers >>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this >>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the >>>> acked array. >>>> >>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the >>>> test is incorrect or if there is something wrong with KVM emulation. >>> >>> I think this is expected, or not. >>> >>> Before INVALL, the LPI-8195 was already pending but disabled. On >>> receiving INVALL, VGIC will reload configuration for all LPIs targeting >>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see >>> and handle it before sending the following INT (which will set the >>> LPI-8195 pending again). >>> >>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using >>>> qemu and kvmtool and Linux v5.8, here's what I found: >>>> >>>> - Using qemu and gic.flat built from*master*: error encountered 864 times >>>> out of 1088 runs. >>>> - Using qemu: error encountered 852 times out of 1027 runs. >>>> - Using kvmtool: error encountered 8164 times out of 10602 runs. >>> >>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g., >>> vcpu-3 hadn't been scheduled), the following INT will set the already >>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3. >>> And we won't see the mentioned failure. >>> >>> I think we can just drop the (meaningless and confusing?) INT. >> >> I think I understand your explanation, the VCPU takes the interrupt immediately >> after the INVALL and before the INT, and the second interrupt that I am seeing is >> the one caused by the INT command. > > Yes. > >> I tried modifying the test like this: >> >> diff --git a/arm/gic.c b/arm/gic.c >> index 6e93da80fe0d..0ef8c12ea234 100644 >> --- a/arm/gic.c >> +++ b/arm/gic.c >> @@ -761,10 +761,17 @@ static void test_its_trigger(void) >> wmb(); >> cpumask_clear(&mask); >> cpumask_set_cpu(3, &mask); >> - its_send_int(dev2, 20); > > Shouldn't its_send_invall(col3) be moved down here? See below. > >> wait_for_interrupts(&mask); >> report(check_acked(&mask, 0, 8195), >> - "dev2/eventid=20 now triggers an LPI"); >> + "dev2/eventid=20 pending LPI is received"); >> + >> + stats_reset(); >> + wmb(); >> + cpumask_clear(&mask); >> + cpumask_set_cpu(3, &mask); >> + its_send_int(dev2, 20); >> + wait_for_interrupts(&mask); >> + report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI"); >> report_prefix_pop(); >> I removed the INT from the initial test, and added a separate one to check that >> the INT command still works. That looks to me that preserves the spirit of the >> original test. After doing stress testing this is what I got: >> >> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for >> the interrupt after INVALL. >> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the >> interrupt after INVALL, just like with kvmtool. > > I guess the reason of failure is that the LPI is taken *immediately* > after the INVALL? > > /* Now call the invall and check the LPI hits */ > its_send_invall(col3); > <- LPI is taken, acked[]++ > stats_reset(); > <- acked[] is cleared unexpectedly > wmb(); > cpumask_clear(&mask); > cpumask_set_cpu(3, &mask); > wait_for_interrupts(&mask); > <- we'll hit timed-out since acked[] is 0 Yes, of course, you're right, I didn't realize that I was resetting the stats *after* the interrupt was enabled. This also explains why I was still seeing timeouts even when the timeout duration was set to 50 seconds. I'll retest with the fix: diff --git a/arm/gic.c b/arm/gic.c index 6e93da80fe0d..c4240f5aba39 100644 --- a/arm/gic.c +++ b/arm/gic.c @@ -756,15 +756,22 @@ static void test_its_trigger(void) "dev2/eventid=20 still does not trigger any LPI"); /* Now call the invall and check the LPI hits */ + stats_reset(); + wmb(); + cpumask_clear(&mask); + cpumask_set_cpu(3, &mask); its_send_invall(col3); + wait_for_interrupts(&mask); + report(check_acked(&mask, 0, 8195), + "dev2/eventid=20 pending LPI is received"); + stats_reset(); wmb(); cpumask_clear(&mask); cpumask_set_cpu(3, &mask); its_send_int(dev2, 20); wait_for_interrupts(&mask); - report(check_acked(&mask, 0, 8195), - "dev2/eventid=20 now triggers an LPI"); + report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI"); report_prefix_pop(); I also pushed a branch at [1]. Thank you so much for spotting this! You've saved me (and probably others) a lot of time debugging. [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2 Thanks, Alex