RE: Some thoughts on Linux benchmarks results & processing

"Bird, Tim" <Tim.Bird@xxxxxxxx> · Fri, 8 Nov 2024 04:10:37 +0000

> -----Original Message-----
> From: Konstantin Belov <konstantin.belov@xxxxxxxxxx>
> Hello colleagues,
> 
> Following up on Tim Bird's presentation "Adding benchmarks results
> support to KTAP/kselftest", I would like to share some thoughts on
> kernel benchmarking and kernel performance evaluation. Tim suggested
> sharing these comments with the wider kselftest community for
> discussion.
> 
> The topic of performance evaluation is obviously extremely complex, so
> I’ve organised my comments into several paragraphs, each of which
> focuses on a specific aspect. This should make it easier to follow and
> understand the key points, such as metrics, reference values, results
> data lake, interpretation of contradictory results, system profiles,
> analysis and methodology.
> 
> # Metrics
> A few remarks on benchmark metrics which were called “values” in the
> original presentation:
> - Metrics must be accompanied by standardised units. This
> standardisation ensures consistency across different tests and
> environments, simplifying accurate comparisons and analysis.
Agreed.  I've used the term "metrics" in the past, but the word
is somewhat overloaded so I avoided it in my recent presentation.
My proposed system includes units, and the possibility of conversion
between units, when needed.

> - Each metric should be clearly labelled with its nature or kind
> (throughput, speed, latency, etc). This classification is essential
> for proper interpretation of the results and prevents
> misunderstandings that could lead to incorrect conclusions.
I'm not sure I agree on this.

> - Presentation contains "May also include allowable variance", but
> variance must be included into the analysis as we deal with
> statistical calculations and multiple randomised values.
In my tool, including variance along with a reference value
is optional.  For some types of value thresholds I don't think the
threshold necessarily requires a variance.

> - I would like to note that other statistical parameters are also
> worth including into comparison, like confidence levels, sample size
> and so on.
> 
> # Reference Values
> The concept of "reference values" introduced in the slides could be
> significantly enhanced by implementing a collaborative, transparent
> system for data collection and validation. This system could operate
> as follows:
> - Data Collection: Any user could submit benchmark results to a
> centralised and public repository. This would allow for a diverse
> range of hardware configurations and use cases to be represented.
I agree there should be a centralized repository for reference values.
It will be easier to create reference values, IMHO if there is also 
a repository of value results as well.

> - Vendor Validation: Hardware vendors would have the opportunity to
> review submitted results pertaining to their products. They could then
> mark certain results as "Vendor Approved," indicating that the results
> align with their own testing and expectations.
It would be nice if Vendors provided reference values along with their
distributions of Linux.

> - Community Review: The broader community of users and experts could
> also review and vote on submitted results. Results that receive
> substantial positive feedback could be marked as "Community Approved,"
> providing an additional layer of validation.
This sounds a little formal to me.

> - Automated Validation: Reference values must be checked, validated
> and supported by multiple sources. This can be done only in an
> automatic way as those processes are time consuming and require
> extreme attention to details.
> - Transparency: All submitted results would need to be accompanied by
> detailed information about the testing environment, hardware
> specifications, and methodology used. This would ensure
> reproducibility and allow others to understand the context of each
> result.
Indeed, there will need to be a lot of meta-data associated with reference
values, in order to make sure that the correct set of reference values
are used in testing specific machines, environments, and kernel versions.

> - Trust Building: The combination of vendor and community approval
> would help establish trust in the reference values. It would mitigate
> concerns about marketing bias and provide a more reliable basis for
> performance comparisons.
> - Accessibility: The system would be publicly accessible, allowing
> anyone to reference and utilise this data in their own testing and
> analysis.
Yes and yes.

> Implementation of such a system would require careful consideration of
> governance and funding. A community-driven, non-profit organisation
> sponsored by multiple stakeholders could be an appropriate model. This
> structure would help maintain neutrality and avoid potential conflicts
> of interest.
KernelCI seems like a good project to host such a repository.  Possibly
it could be KCIdb, who have just added values to their database schema.

> 
> While the specifics of building and managing such a system would need
> further exploration, this approach could significantly improve the
> reliability and usefulness of reference values in benchmark testing.
> It would foster a more collaborative and transparent environment for
> performance evaluation in the Linux ecosystem as well as attract
> interested vendors to submit and review results.
> 
> I’m not very informed about the current state of the community in this
> field, but I’m sure you know better how exactly this can be done.
> 
> # Results Data Lake
> Along with reference values it’s important to collect results on a
> regular basis as the kernel evolves so results must follow this
> evolution as well. To do this cloud-based data lake is needed (a
> self-hosted system will be too expensive from my point of view).
> 
> This data lake should be able to collect and process incoming data as
> well as to serve reference values for users. Data processing flow
> should be quite standard: Collection -> Parsing + Enhancement ->
> Storage -> Analysis -> Serving.
> 
> Tim proposed to use file names for reference files, I would like to
> note that such approach could fail pretty fast if system will collect
> more and more data and there will rise a need to have more granular
> and detailed features to identify reference results and this can lead
> to very long filenames, which will be hard to use.
No argument there.  I used filenames for my proof of concept, but
clearly a different meta-data matching system that utilizes more
variables will need to be used.  I'm currently working on gathering data for
a boot-time test to inform the set of meta-data that applies to
boot-time performance data.

> I propose to use
> UUID4-based identification, which provides very low chances for
> collision. Those IDs will be keys in the database with all information
> required for clear identification of relevant results and
> corresponding details. Moreover this approach can be easily extended
> on the database side if more data is needed.
I'm not sure how this solves the matching problem.  To determine
test outcomes, you need to use reference data that is most similar
to your machine.  Having a shared repository of reference values will
be useful for testers with limited experience, who are working on common
hardware. However, I envision that many testers will use previous results
from their own board, for reference values for tests (after validating the numbers
against their requirements).

> 
> Yes, UUID4 is not human-readable, but do we need such an option if we
> have tools, which can provide a better interface?
> 
> For example, this could be something like:
> ---
> request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu
> 22.04" -t "baseline" -m "response_time>100"
> response:
> [
> {
>      "id": "550e8400-e29b-41d4-a716-446655440005",
>      "benchmark": "Test Suite A",
>      "version": "v1.2.3",
>      "target_os": "Ubuntu 22.04",
>      "metrics": {
>          "cpu_usage": 70.5,
>          "memory_usage": 2048,
>          "response_time": 120
>      },
>      "tags": ["baseline", "v1.0"],
>      "created_at": "2024-10-25T10:00:00Z"
> },
> ...
> ]
> ---
> or
> request: results-cli search "<Domain-Specific-Language-Query>"
> response: [ {}, {}, {}...]
> ---
> or
> request: results-cli get 550e8400-e29b-41d4-a716-446655440005
> response:
> {
>      "id": "550e8400-e29b-41d4-a716-446655440005",
>      "benchmark": "Test Suite A",
>      "version": "v1.2.3",
>      "target_os": "Ubuntu 22.04",
>      "metrics": {
>          "cpu_usage": 70.5,
>          "memory_usage": 2048,
>          "response_time": 120
>      },
>      "tags": ["baseline", "v1.0"],
>      "created_at": "2024-10-25T10:00:00Z"
> }
> ---
> or
> request: curl -X POST http://api.example.com/references/search \ -d '{
> "query": "benchmark = \"Test Suite A\" AND (version >= \"v1.2\" OR tag
> IN [\"baseline\", \"regression\"]) AND cpu_usage > 60" }'
> ...
> ---
It sounds like you have a database  implementation already in mind.

> 
> Another point of use DB-based approach is the following: in case when
> a user works with particular hardware and/or would like to use a
> reference he/she does not need a full database with all collected
> reference values, but only a small slice of it. This slice can be
> downloaded from public repo or accessed via API.
> 
> # Large results dataset
> If we collect a large benchmarks dataset in one place accompanied with
> detailed information about target systems from which this dataset was
> collected, then it will allow us to calculate precise baselines across
> different compositions of parameters, making performance deviations
> easier to detect. Long-term trend analysis can identify small changes
> and correlate them with updates, revealing performance drift.
> 
> Another use of such a database - predictive modelling, which can
> provide forecasts of expected results and setting dynamic performance
> thresholds, enabling early issue detection. Anomaly detection becomes
> more effective with context, distinguishing unusual deviations from
> normal behaviour.
> 
> # Interpretation of contradictory results
> It’s not clear how to deal with contradictory results to make a
> decision on regression presence. For example, we have a set of 10
> tests, which test more or less the same, for example disk performance.
> It’s unclear what to do when one subset of tests show degradation and
> another subset shows neutral status or improvements. Is there a
> regression?
If a test shows a regression, then either there has been a regression and
the test is accurate, or there has not been a regression and the test
or reference values need fixing.

> 
> I suppose that the availability of historical data can help to deal
> with such situations as historical data can show behaviour of
> particular tests and allow to assign weights in decision-making
> algorithms, but it’s just my guess.
> 
> # System Profiles
> Tim's idea to reduce results to - “pass / fail” and my experience with
> various people trying to interpret benchmarking results led me to
> think of “profiles” - a set of parameters and metrics collected from a
> reference system while execution of a particular configuration of a
> particular benchmark.

Numbers alone have no meaning. They have to be converted into some
signal indicating that action is needed.  At Plumbers one developer
indicated that more than just a binary state would be desirable.  The testcase
outcome is intended to indicate that some issue needs addressing, and
there will always be a threshold needed.
> 
> Profiles can be used for A/B comparison with pass/fail outcomes or
> match/not match, and this approach does not hide/miss the details and
> allows to capture multiple characteristics of the experiment, like
> presence of outliers/errors or skewed distribution form. Interested
> persons (like kernel developers or performance engineers, for example)
> can dig deeper to find a reason for such mismatch and those who are
> interested just in high-level results - pass/fail should be enough.
Of course results should be available to allow diagnosing the problem.
But for a CI system to indicate that action is needed, there must be
some numeric comparison (whether that represents something more
expansive like a curve shape or an aggregation of data points)
> 
> Here is how I imaging a structure of a profile:
> ---
> profile_a:
>   system packages:
>    - pkg_1
>    - pkg_2
>    # Additional packages...
> 
> settings:
>    - cmdline
>    # Additional settings...
> 
> indicators:
>    cpu: null
>    ram: null
>    loadavg: null
>    # Additional indicators...
> 
> benchmark:
> settings:
>    param_1: null
>    param_2: null
>    param_x: null
> 
> metrics:
>    metric_1: null
>    metric_2: null
>    metric_x: null
> ---
> 
> - System Packages, System Settings: Usually we do not pay much
> attention to this, but I think it’s worth highlighting that base OS is
> an important factor, as there are distribution-specific modifications
> present in the filesystem. Most commonly developers and researchers
> use Ubuntu (as the most popular distro) or Debian (as a cleaner and
> lightweight version of Ubuntu), but distributions apply their own
> patches to the kernel and system libraries, which may impact
> performance. Another kind of base OS - cloud OS images which can be
> modified by cloud providers to add internal packages & services which
> could potentially affect performance as well. While comparing we must
> take into account this aspect to compare apples-to-apples.
> - System Indicators: These are periodic statistics like CPU
> utilisation, RAM consumption, and other params collected before
> benchmarking, while benchmarking and after benchmarking.
> - Benchmark Settings: Benchmarking systems have multiple parameters,
> so it’s important to capture them and use them in analysis.
> - Benchmark Metrics: That’s obviously - benchmark results. It’s not a
> rare case when a benchmark test provides more than a single number.
> 
> # Analysis
> Proposed rules-based analysis will work only for highly determined
> environments and systems, where rules can describe all the aspects.
> Rule-based systems are easier to understand and implement than other
> types, but for a small set of rules. However, we deal with the live
> system and it constantly evolves, so rules will deprecate extremely
> fast. It's the same story as with rule-based recommended systems in
> early years of machine learning.
> 
> If you want to follow a rules-based approach, it's probably worth
> taking a look at https://www.clipsrules.net as this will allow to
> decouple results from analysis and avoid reinventing the analysis
> engine.
> 
> Declaration of those rules will be error-prone due to the nature of
> their origin - they must be declared and maintained by humans. IMHO a
> human-less approach and use modern ML methods instead would be more
> beneficial in the long run.
Well, the rules I was proposing (criteria rules) were about the conversion 
from numbers to testcase outcomes, and some way to express the
direction of operation for numeric comparisons.  Not much more than that.
There was no intelligence intended beyond simple numeric comparisons
of result values with reference values.  I'd prefer not to overcomplicate
the proposal with analysis of data.  It's quite possible that the rules
should be baked into the test, rather than having them separate like
I proposed.  Most tests would have rather obvious comparison directions,
so having the rules be separate may be an unnecessary abstraction.
(ie, throughput less than a reference value is a regression, or latency more
than a reference value is a regression.)

> 
> # Methodology
> Used methodology - another aspect which is not directly related to
> Tim's slides, but it's an important topic for results processing and
> interpretation, probably an idea of automated results interpretation
> can force use of one or another methodology.
> 
> # Next steps
> I would be glad to participate in further discussions and share
> experience to improve kernel performance testing automation, analysis
> and interpretation of results. If there is interest, I'm open to
> collaborating on implementing some of these ideas.

Sounds good.  I'm tied up with some boot-time initiatives at the moment,
but I hope to work on my proposal some more, and find a home (ie public
repository) for some kselftest-related reference values, before the end of
the year.  I plan to submit the actual code for my proposal, along with
some tests that utilize it to the list, and we can have more discussion when
that happens.

Thanks for sharing your ideas.
   -- Tim