Some thoughts on Linux benchmarks results & processing

Konstantin Belov <konstantin.belov@xxxxxxxxxx> · Wed, 6 Nov 2024 17:49:17 +0100

Hello colleagues,

Following up on Tim Bird's presentation "Adding benchmarks results
support to KTAP/kselftest", I would like to share some thoughts on
kernel benchmarking and kernel performance evaluation. Tim suggested
sharing these comments with the wider kselftest community for
discussion.

The topic of performance evaluation is obviously extremely complex, so
I’ve organised my comments into several paragraphs, each of which
focuses on a specific aspect. This should make it easier to follow and
understand the key points, such as metrics, reference values, results
data lake, interpretation of contradictory results, system profiles,
analysis and methodology.

# Metrics
A few remarks on benchmark metrics which were called “values” in the
original presentation:
- Metrics must be accompanied by standardised units. This
standardisation ensures consistency across different tests and
environments, simplifying accurate comparisons and analysis.
- Each metric should be clearly labelled with its nature or kind
(throughput, speed, latency, etc). This classification is essential
for proper interpretation of the results and prevents
misunderstandings that could lead to incorrect conclusions.
- Presentation contains "May also include allowable variance", but
variance must be included into the analysis as we deal with
statistical calculations and multiple randomised values.
- I would like to note that other statistical parameters are also
worth including into comparison, like confidence levels, sample size
and so on.

# Reference Values
The concept of "reference values" introduced in the slides could be
significantly enhanced by implementing a collaborative, transparent
system for data collection and validation. This system could operate
as follows:
- Data Collection: Any user could submit benchmark results to a
centralised and public repository. This would allow for a diverse
range of hardware configurations and use cases to be represented.
- Vendor Validation: Hardware vendors would have the opportunity to
review submitted results pertaining to their products. They could then
mark certain results as "Vendor Approved," indicating that the results
align with their own testing and expectations.
- Community Review: The broader community of users and experts could
also review and vote on submitted results. Results that receive
substantial positive feedback could be marked as "Community Approved,"
providing an additional layer of validation.
- Automated Validation: Reference values must be checked, validated
and supported by multiple sources. This can be done only in an
automatic way as those processes are time consuming and require
extreme attention to details.
- Transparency: All submitted results would need to be accompanied by
detailed information about the testing environment, hardware
specifications, and methodology used. This would ensure
reproducibility and allow others to understand the context of each
result.
- Trust Building: The combination of vendor and community approval
would help establish trust in the reference values. It would mitigate
concerns about marketing bias and provide a more reliable basis for
performance comparisons.
- Accessibility: The system would be publicly accessible, allowing
anyone to reference and utilise this data in their own testing and
analysis.

Implementation of such a system would require careful consideration of
governance and funding. A community-driven, non-profit organisation
sponsored by multiple stakeholders could be an appropriate model. This
structure would help maintain neutrality and avoid potential conflicts
of interest.

While the specifics of building and managing such a system would need
further exploration, this approach could significantly improve the
reliability and usefulness of reference values in benchmark testing.
It would foster a more collaborative and transparent environment for
performance evaluation in the Linux ecosystem as well as attract
interested vendors to submit and review results.

I’m not very informed about the current state of the community in this
field, but I’m sure you know better how exactly this can be done.

# Results Data Lake
Along with reference values it’s important to collect results on a
regular basis as the kernel evolves so results must follow this
evolution as well. To do this cloud-based data lake is needed (a
self-hosted system will be too expensive from my point of view).

This data lake should be able to collect and process incoming data as
well as to serve reference values for users. Data processing flow
should be quite standard: Collection -> Parsing + Enhancement ->
Storage -> Analysis -> Serving.

Tim proposed to use file names for reference files, I would like to
note that such approach could fail pretty fast if system will collect
more and more data and there will rise a need to have more granular
and detailed features to identify reference results and this can lead
to very long filenames, which will be hard to use. I propose to use
UUID4-based identification, which provides very low chances for
collision. Those IDs will be keys in the database with all information
required for clear identification of relevant results and
corresponding details. Moreover this approach can be easily extended
on the database side if more data is needed.

Yes, UUID4 is not human-readable, but do we need such an option if we
have tools, which can provide a better interface?

For example, this could be something like:
---
request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu
22.04" -t "baseline" -m "response_time>100"
response:
[
{
     "id": "550e8400-e29b-41d4-a716-446655440005",
     "benchmark": "Test Suite A",
     "version": "v1.2.3",
     "target_os": "Ubuntu 22.04",
     "metrics": {
         "cpu_usage": 70.5,
         "memory_usage": 2048,
         "response_time": 120
     },
     "tags": ["baseline", "v1.0"],
     "created_at": "2024-10-25T10:00:00Z"
},
...
]
---
or
request: results-cli search "<Domain-Specific-Language-Query>"
response: [ {}, {}, {}...]
---
or
request: results-cli get 550e8400-e29b-41d4-a716-446655440005
response:
{
     "id": "550e8400-e29b-41d4-a716-446655440005",
     "benchmark": "Test Suite A",
     "version": "v1.2.3",
     "target_os": "Ubuntu 22.04",
     "metrics": {
         "cpu_usage": 70.5,
         "memory_usage": 2048,
         "response_time": 120
     },
     "tags": ["baseline", "v1.0"],
     "created_at": "2024-10-25T10:00:00Z"
}
---
or
request: curl -X POST http://api.example.com/references/search \ -d '{
"query": "benchmark = \"Test Suite A\" AND (version >= \"v1.2\" OR tag
IN [\"baseline\", \"regression\"]) AND cpu_usage > 60" }'
...
---

Another point of use DB-based approach is the following: in case when
a user works with particular hardware and/or would like to use a
reference he/she does not need a full database with all collected
reference values, but only a small slice of it. This slice can be
downloaded from public repo or accessed via API.

# Large results dataset
If we collect a large benchmarks dataset in one place accompanied with
detailed information about target systems from which this dataset was
collected, then it will allow us to calculate precise baselines across
different compositions of parameters, making performance deviations
easier to detect. Long-term trend analysis can identify small changes
and correlate them with updates, revealing performance drift.

Another use of such a database - predictive modelling, which can
provide forecasts of expected results and setting dynamic performance
thresholds, enabling early issue detection. Anomaly detection becomes
more effective with context, distinguishing unusual deviations from
normal behaviour.

# Interpretation of contradictory results
It’s not clear how to deal with contradictory results to make a
decision on regression presence. For example, we have a set of 10
tests, which test more or less the same, for example disk performance.
It’s unclear what to do when one subset of tests show degradation and
another subset shows neutral status or improvements. Is there a
regression?

I suppose that the availability of historical data can help to deal
with such situations as historical data can show behaviour of
particular tests and allow to assign weights in decision-making
algorithms, but it’s just my guess.

# System Profiles
Tim's idea to reduce results to - “pass / fail” and my experience with
various people trying to interpret benchmarking results led me to
think of “profiles” - a set of parameters and metrics collected from a
reference system while execution of a particular configuration of a
particular benchmark.

Profiles can be used for A/B comparison with pass/fail outcomes or
match/not match, and this approach does not hide/miss the details and
allows to capture multiple characteristics of the experiment, like
presence of outliers/errors or skewed distribution form. Interested
persons (like kernel developers or performance engineers, for example)
can dig deeper to find a reason for such mismatch and those who are
interested just in high-level results - pass/fail should be enough.

Here is how I imaging a structure of a profile:
---
profile_a:
  system packages:
   - pkg_1
   - pkg_2
   # Additional packages...

settings:
   - cmdline
   # Additional settings...

indicators:
   cpu: null
   ram: null
   loadavg: null
   # Additional indicators...

benchmark:
settings:
   param_1: null
   param_2: null
   param_x: null

metrics:
   metric_1: null
   metric_2: null
   metric_x: null
---

- System Packages, System Settings: Usually we do not pay much
attention to this, but I think it’s worth highlighting that base OS is
an important factor, as there are distribution-specific modifications
present in the filesystem. Most commonly developers and researchers
use Ubuntu (as the most popular distro) or Debian (as a cleaner and
lightweight version of Ubuntu), but distributions apply their own
patches to the kernel and system libraries, which may impact
performance. Another kind of base OS - cloud OS images which can be
modified by cloud providers to add internal packages & services which
could potentially affect performance as well. While comparing we must
take into account this aspect to compare apples-to-apples.
- System Indicators: These are periodic statistics like CPU
utilisation, RAM consumption, and other params collected before
benchmarking, while benchmarking and after benchmarking.
- Benchmark Settings: Benchmarking systems have multiple parameters,
so it’s important to capture them and use them in analysis.
- Benchmark Metrics: That’s obviously - benchmark results. It’s not a
rare case when a benchmark test provides more than a single number.

# Analysis
Proposed rules-based analysis will work only for highly determined
environments and systems, where rules can describe all the aspects.
Rule-based systems are easier to understand and implement than other
types, but for a small set of rules. However, we deal with the live
system and it constantly evolves, so rules will deprecate extremely
fast. It's the same story as with rule-based recommended systems in
early years of machine learning.

If you want to follow a rules-based approach, it's probably worth
taking a look at https://www.clipsrules.net as this will allow to
decouple results from analysis and avoid reinventing the analysis
engine.

Declaration of those rules will be error-prone due to the nature of
their origin - they must be declared and maintained by humans. IMHO a
human-less approach and use modern ML methods instead would be more
beneficial in the long run.

# Methodology
Used methodology - another aspect which is not directly related to
Tim's slides, but it's an important topic for results processing and
interpretation, probably an idea of automated results interpretation
can force use of one or another methodology.

# Next steps
I would be glad to participate in further discussions and share
experience to improve kernel performance testing automation, analysis
and interpretation of results. If there is interest, I'm open to
collaborating on implementing some of these ideas.
--
Best regards,
Konstantin Belov