Re: [PATCH 5/5] selftests/resctrl: Reduce failures due to outliers in MBA/MBM tests

Reinette Chatre <reinette.chatre@xxxxxxxxx> · Wed, 13 Sep 2023 14:00:19 -0700

Hi Ilpo,

On 9/13/2023 4:43 AM, Ilpo Järvinen wrote:
> On Tue, 12 Sep 2023, Reinette Chatre wrote:
>> On 9/11/2023 4:19 AM, Ilpo Järvinen wrote:
>>> 5% difference upper bound for success is a bit on the low side for the
>>
>> "a bit on the low side" is very vague.
> 
> The commit that introduced that 5% bound plainly admitted it's "randomly 
> chosen value". At least that wasn't vague, I guess. :-)
> 
> So what I'm trying to do here is to have "randomly chosen value" replaced 
> with a value that seems to work well enough based on measurements on 
> a large set of platforms.

Already a better motivation for this change. Your cover letter also hints
at this but this description does not mention that this is not just
another number pulled from the air but indeed one that is based on
significant testing on a large variety of systems. This description can
surely mention all the work you did that ended with proposing this new
number, no?

> 
> Personally, I don't care much about this, I can just ignore the failures 
> due to outliers (and also reports about failing MBA/MBM test if somebody 
> ever sends one to me), but if I'd be one running automated tests it would 
> be annoying to have a problem like this unaddressed.

In no way was I suggesting that this should not be addressed.

> 
>>> MBA and MBM tests. Some platforms produce outliers that are slightly
>>> above that, typically 6-7%.
>>>
>>> Relaxing the MBA/MBM success bound to 8% removes most of the failures
>>> due those frequent outliers.
>>
>> This description needs more context on what issue is being solved here.
>> What does the % difference represent? How was new percentage determined?
>>
>> Did you investigate why there are differences between platforms? From
>> what I understand these tests measure memory bandwidth using perf and
>> resctrl and then compare the difference. Are there interesting things 
>> about the platforms on which the difference is higher than 5%?
> 
> Not really I think. The number just isn't that stable to always remain 
> below 5% (even if it usually does).
> 
> Only systematic thing I've come across is that if I play with the read 
> pattern for defeating the hw prefetcher (you've seen a patch earlier and 
> it will be among the series I'll send after this one), it has an impact 
> which looks more systematic across all MBM/MBA tests. But it's not what 
> I'm trying now address with this patch.
> 
>> Could
>> those be systems with multiple sockets (and thus multiple PMUs that need
>> to be setup, reset, and read)? Can the reading of the counters be improved
>> instead of relaxing the success criteria? A quick comparison between
>> get_mem_bw_imc() and get_mem_bw_resctrl() makes me think that a difference
>> is not surprising ... note how the PMU counters are started and reset
>> (potentially on multiple sockets) at every iteration while the resctrl
>> counters keep rolling and new values are just subtracted from previous.
> 
> Perhaps, I can try to look into it (add to my todo list so I won't 
> forget). But in the meantime, this new value is picked using a criteria 
> that looks better than "randomly chosen value". If I ever manage to 
> address the outliers, the bound could be lowered again.
> 
> I'll update the changelog to explain things better.
> 
> 
ok, thank you.

Reinette