Hi, On 11/30/20 2:19 PM, Alexandru Elisei wrote: > Hi Zenghui, > > On 11/30/20 1:59 PM, Zenghui Yu wrote: >> Hi Alex, >> >> On 2020/11/27 22:50, Alexandru Elisei wrote: >>> Hi Zhenghui, >>> >>> Thank you for having a look at this! >>> >>> On 11/26/20 9:30 AM, Zenghui Yu wrote: >>>> On 2020/11/25 23:51, Alexandru Elisei wrote: >>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers >>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this >>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the >>>>> acked array. >>>>> >>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the >>>>> test is incorrect or if there is something wrong with KVM emulation. >>>> I think this is expected, or not. >>>> >>>> Before INVALL, the LPI-8195 was already pending but disabled. On >>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting >>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see >>>> and handle it before sending the following INT (which will set the >>>> LPI-8195 pending again). >>>> >>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using >>>>> qemu and kvmtool and Linux v5.8, here's what I found: >>>>> >>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times >>>>> out of 1088 runs. >>>>> - Using qemu: error encountered 852 times out of 1027 runs. >>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs. >>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g., >>>> vcpu-3 hadn't been scheduled), the following INT will set the already >>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3. >>>> And we won't see the mentioned failure. >>>> >>>> I think we can just drop the (meaningless and confusing?) INT. >>> I think I understand your explanation, the VCPU takes the interrupt immediately >>> after the INVALL and before the INT, and the second interrupt that I am seeing is >>> the one caused by the INT command. >> Yes. >> >>> I tried modifying the test like this: >>> >>> diff --git a/arm/gic.c b/arm/gic.c >>> index 6e93da80fe0d..0ef8c12ea234 100644 >>> --- a/arm/gic.c >>> +++ b/arm/gic.c >>> @@ -761,10 +761,17 @@ static void test_its_trigger(void) >>> wmb(); >>> cpumask_clear(&mask); >>> cpumask_set_cpu(3, &mask); >>> - its_send_int(dev2, 20); >> Shouldn't its_send_invall(col3) be moved down here? See below. >> >>> wait_for_interrupts(&mask); >>> report(check_acked(&mask, 0, 8195), >>> - "dev2/eventid=20 now triggers an LPI"); >>> + "dev2/eventid=20 pending LPI is received"); >>> + >>> + stats_reset(); >>> + wmb(); >>> + cpumask_clear(&mask); >>> + cpumask_set_cpu(3, &mask); >>> + its_send_int(dev2, 20); >>> + wait_for_interrupts(&mask); >>> + report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI"); >>> report_prefix_pop(); >>> I removed the INT from the initial test, and added a separate one to check that >>> the INT command still works. That looks to me that preserves the spirit of the >>> original test. After doing stress testing this is what I got: >>> >>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for >>> the interrupt after INVALL. >>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the >>> interrupt after INVALL, just like with kvmtool. >> I guess the reason of failure is that the LPI is taken *immediately* >> after the INVALL? >> >> /* Now call the invall and check the LPI hits */ >> its_send_invall(col3); >> <- LPI is taken, acked[]++ >> stats_reset(); >> <- acked[] is cleared unexpectedly >> wmb(); >> cpumask_clear(&mask); >> cpumask_set_cpu(3, &mask); >> wait_for_interrupts(&mask); >> <- we'll hit timed-out since acked[] is 0 > Yes, of course, you're right, I didn't realize that I was resetting the stats > *after* the interrupt was enabled. This also explains why I was still seeing > timeouts even when the timeout duration was set to 50 seconds. I'll retest with > the fix: > > diff --git a/arm/gic.c b/arm/gic.c > index 6e93da80fe0d..c4240f5aba39 100644 > --- a/arm/gic.c > +++ b/arm/gic.c > @@ -756,15 +756,22 @@ static void test_its_trigger(void) > "dev2/eventid=20 still does not trigger any LPI"); > > /* Now call the invall and check the LPI hits */ > + stats_reset(); > + wmb(); > + cpumask_clear(&mask); > + cpumask_set_cpu(3, &mask); > its_send_invall(col3); > + wait_for_interrupts(&mask); > + report(check_acked(&mask, 0, 8195), > + "dev2/eventid=20 pending LPI is received"); > + > stats_reset(); > wmb(); > cpumask_clear(&mask); > cpumask_set_cpu(3, &mask); > its_send_int(dev2, 20); > wait_for_interrupts(&mask); > - report(check_acked(&mask, 0, 8195), > - "dev2/eventid=20 now triggers an LPI"); > + report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI"); > > report_prefix_pop(); > > I also pushed a branch at [1]. > > Thank you so much for spotting this! You've saved me (and probably others) a lot > of time debugging. > > [1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2 I have been testing the branch, no failures after 17,996 runs with qemu and 58,669 runs with kvmtool. This looks fine to me, I'll send a v2 with the fix. Thanks, Alex