Re: [kvm-unit-tests PATCH 10/10] arm64: gic: Use IPI test checking for the LPI tests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Zenghui,

On 11/30/20 1:59 PM, Zenghui Yu wrote:
> Hi Alex,
>
> On 2020/11/27 22:50, Alexandru Elisei wrote:
>> Hi Zhenghui,
>>
>> Thank you for having a look at this!
>>
>> On 11/26/20 9:30 AM, Zenghui Yu wrote:
>>> On 2020/11/25 23:51, Alexandru Elisei wrote:
>>>> The reason for the failure is that the test "dev2/eventid=20 now triggers
>>>> an LPI" triggers 2 LPIs, not one. This behavior was present before this
>>>> patch, but it was ignored because check_lpi_stats() wasn't looking at the
>>>> acked array.
>>>>
>>>> I'm not familiar with the ITS so I'm not sure if this is expected, if the
>>>> test is incorrect or if there is something wrong with KVM emulation.
>>>
>>> I think this is expected, or not.
>>>
>>> Before INVALL, the LPI-8195 was already pending but disabled. On
>>> receiving INVALL, VGIC will reload configuration for all LPIs targeting
>>> collection-3 and deliver the now enabled LPI-8195. We'll therefore see
>>> and handle it before sending the following INT (which will set the
>>> LPI-8195 pending again).
>>>
>>>> Did some more testing on an Ampere eMAG (fast out-of-order cores) using
>>>> qemu and kvmtool and Linux v5.8, here's what I found:
>>>>
>>>> - Using qemu and gic.flat built from*master*: error encountered 864 times
>>>>     out of 1088 runs.
>>>> - Using qemu: error encountered 852 times out of 1027 runs.
>>>> - Using kvmtool: error encountered 8164 times out of 10602 runs.
>>>
>>> If vcpu-3 hadn't seen and handled LPI-8195 as quickly as possible (e.g.,
>>> vcpu-3 hadn't been scheduled), the following INT will set the already
>>> pending LPI-8195 pending again and we'll receive it *once* on vcpu-3.
>>> And we won't see the mentioned failure.
>>>
>>> I think we can just drop the (meaningless and confusing?) INT.
>>
>> I think I understand your explanation, the VCPU takes the interrupt immediately
>> after the INVALL and before the INT, and the second interrupt that I am seeing is
>> the one caused by the INT command.
>
> Yes.
>
>> I tried modifying the test like this:
>>
>> diff --git a/arm/gic.c b/arm/gic.c
>> index 6e93da80fe0d..0ef8c12ea234 100644
>> --- a/arm/gic.c
>> +++ b/arm/gic.c
>> @@ -761,10 +761,17 @@ static void test_its_trigger(void)
>>          wmb();
>>          cpumask_clear(&mask);
>>          cpumask_set_cpu(3, &mask);
>> -       its_send_int(dev2, 20);
>
> Shouldn't its_send_invall(col3) be moved down here? See below.
>
>>          wait_for_interrupts(&mask);
>>          report(check_acked(&mask, 0, 8195),
>> -                       "dev2/eventid=20 now triggers an LPI");
>> +                       "dev2/eventid=20 pending LPI is received");
>> +
>> +       stats_reset();
>> +       wmb();
>> +       cpumask_clear(&mask);
>> +       cpumask_set_cpu(3, &mask);
>> +       its_send_int(dev2, 20);
>> +       wait_for_interrupts(&mask);
>> +       report(check_acked(&mask, 0, 8195), "dev2/eventid=20 triggers an LPI");
>>            report_prefix_pop();
>>   I removed the INT from the initial test, and added a separate one to check that
>> the INT command still works. That looks to me that preserves the spirit of the
>> original test. After doing stress testing this is what I got:
>>
>> - with kvmtool, 47,709 iterations, 27 times the test timed out when waiting for
>> the interrupt after INVALL.
>> - with qemu, 15,511 iterations, 258 times the test timed out when waiting for the
>> interrupt after INVALL, just like with kvmtool.
>
> I guess the reason of failure is that the LPI is taken *immediately*
> after the INVALL?
>
>     /* Now call the invall and check the LPI hits */
>     its_send_invall(col3);
>         <- LPI is taken, acked[]++
>     stats_reset();
>         <- acked[] is cleared unexpectedly
>     wmb();
>     cpumask_clear(&mask);
>     cpumask_set_cpu(3, &mask);
>     wait_for_interrupts(&mask);
>         <- we'll hit timed-out since acked[] is 0

Yes, of course, you're right, I didn't realize that I was resetting the stats
*after* the interrupt was enabled. This also explains why I was still seeing
timeouts even when the timeout duration was set to 50 seconds. I'll retest with
the fix:

diff --git a/arm/gic.c b/arm/gic.c
index 6e93da80fe0d..c4240f5aba39 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -756,15 +756,22 @@ static void test_its_trigger(void)
                        "dev2/eventid=20 still does not trigger any LPI");
 
        /* Now call the invall and check the LPI hits */
+       stats_reset();
+       wmb();
+       cpumask_clear(&mask);
+       cpumask_set_cpu(3, &mask);
        its_send_invall(col3);
+       wait_for_interrupts(&mask);
+       report(check_acked(&mask, 0, 8195),
+                       "dev2/eventid=20 pending LPI is received");
+
        stats_reset();
        wmb();
        cpumask_clear(&mask);
        cpumask_set_cpu(3, &mask);
        its_send_int(dev2, 20);
        wait_for_interrupts(&mask);
-       report(check_acked(&mask, 0, 8195),
-                       "dev2/eventid=20 now triggers an LPI");
+       report(check_acked(&mask, 0, 8195), "dev2/eventid20 triggers an LPI");
 
        report_prefix_pop();
 
I also pushed a branch at [1].

Thank you so much for spotting this! You've saved me (and probably others) a lot
of time debugging.

[1] https://gitlab.arm.com/linux-arm/kvm-unit-tests-ae/-/tree/fixes1-v2

Thanks,
Alex



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux