Re: [kvm-unit-tests Patch v2 4/5] x86: pmu: Support validation for Intel PMU fixed counter 3

"Liang, Kan" <kan.liang@xxxxxxxxxxxxxxx> · Wed, 1 Nov 2023 09:51:59 -0400

On 2023-10-31 11:57 p.m., Mi, Dapeng wrote:
> 
> On 11/1/2023 11:24 AM, Jim Mattson wrote:
>> On Tue, Oct 31, 2023 at 8:16 PM Mi, Dapeng
>> <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
>>>
>>> On 11/1/2023 10:47 AM, Jim Mattson wrote:
>>>> On Tue, Oct 31, 2023 at 7:33 PM Mi, Dapeng
>>>> <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
>>>>> On 11/1/2023 2:47 AM, Jim Mattson wrote:
>>>>>> On Tue, Oct 31, 2023 at 2:22 AM Dapeng Mi
>>>>>> <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
>>>>>>> Intel CPUs, like Sapphire Rapids, introduces a new fixed counter
>>>>>>> (fixed counter 3) to counter/sample topdown.slots event, but current
>>>>>>> code still doesn't cover this new fixed counter.
>>>>>>>
>>>>>>> So this patch adds code to validate this new fixed counter can count
>>>>>>> slots event correctly.
>>>>>> I'm not convinced that this actually validates anything.
>>>>>>
>>>>>> Suppose, for example, that KVM used fixed counter 1 when the guest
>>>>>> asked for fixed counter 3. Wouldn't this test still pass?
>>>>> Per my understanding, as long as the KVM returns a valid count in the
>>>>> reasonable count range, we can think KVM works correctly. We don't
>>>>> need
>>>>> to entangle on how KVM really uses the HW, it could be impossible and
>>>>> unnecessary.
>>>> Now, I see how the Pentium FDIV bug escaped notice. Hey, the numbers
>>>> are in a reasonable range. What's everyone upset about?
>>>>
>>>>> Yeah, currently the predefined valid count range may be some kind of
>>>>> loose since I want to cover as much as hardwares and avoid to cause
>>>>> regression. Especially after introducing the random jump and clflush
>>>>> instructions, the cycles and slots become much more hard to predict.
>>>>> Maybe we can have a comparable restricted count range in the initial
>>>>> change, and we can loosen the restriction then if we encounter a
>>>>> failure
>>>>> on some specific hardware. do you think it's better? Thanks.
>>>> I think the test is essentially useless, and should probably just be
>>>> deleted, so that it doesn't give a false sense of confidence.
>>> IMO, I can't say the tests are totally useless. Yes,  passing the tests
>>> doesn't mean the KVM vPMU must work correctly, but we can say there is
>>> something probably wrong if it fails to pass these tests. Considering
>>> the hardware differences, it's impossible to set an exact value for
>>> these events in advance and it seems there is no better method to verify
>>> the PMC count as well. I still prefer to keep these tests until we have
>>> a better method to verify the accuracy of the PMC count.
>> If it's impossible to set an exact value for these events in advance,
>> how does Intel validate the hardware PMU?
> 
> 
> I have no much idea how HW team validates the PMU functionality. But per
> my gotten information, they could have some very tiny benchmarks with a
> fixed pattern and run them on a certain scenario, so they can expect an
> very accurate count value. But this is different with our case, a real
> program is executed on a real system (probably shared with other
> programs), the events count is impacted by too much hardware/software
> factors, such as cache contention, it's hard to predict a single
> accurate count in advance.
>

Yes, there are many factors could impact the value of the
microbenchmarks. I don't think there is a universal benchmark for all
generations and all configurations.

Thanks,
Kan

> Anyway, it's only my guess about the ways of hardware validation, still
> add Kan to get more information.
> 
> Hi Kan,
> 
> Do you have more information about how HW team to validate the PMC count
> accuracy? Thanks.
> 
>