Hello colleagues, Following up on Tim Bird's presentation "Adding benchmarks results support to KTAP/kselftest", I would like to share some thoughts on kernel benchmarking and kernel performance evaluation. Tim suggested sharing these comments with the wider kselftest community for discussion. The topic of performance evaluation is obviously extremely complex, so I’ve organised my comments into several paragraphs, each of which focuses on a specific aspect. This should make it easier to follow and understand the key points, such as metrics, reference values, results data lake, interpretation of contradictory results, system profiles, analysis and methodology. # Metrics A few remarks on benchmark metrics which were called “values” in the original presentation: - Metrics must be accompanied by standardised units. This standardisation ensures consistency across different tests and environments, simplifying accurate comparisons and analysis. - Each metric should be clearly labelled with its nature or kind (throughput, speed, latency, etc). This classification is essential for proper interpretation of the results and prevents misunderstandings that could lead to incorrect conclusions. - Presentation contains "May also include allowable variance", but variance must be included into the analysis as we deal with statistical calculations and multiple randomised values. - I would like to note that other statistical parameters are also worth including into comparison, like confidence levels, sample size and so on. # Reference Values The concept of "reference values" introduced in the slides could be significantly enhanced by implementing a collaborative, transparent system for data collection and validation. This system could operate as follows: - Data Collection: Any user could submit benchmark results to a centralised and public repository. This would allow for a diverse range of hardware configurations and use cases to be represented. - Vendor Validation: Hardware vendors would have the opportunity to review submitted results pertaining to their products. They could then mark certain results as "Vendor Approved," indicating that the results align with their own testing and expectations. - Community Review: The broader community of users and experts could also review and vote on submitted results. Results that receive substantial positive feedback could be marked as "Community Approved," providing an additional layer of validation. - Automated Validation: Reference values must be checked, validated and supported by multiple sources. This can be done only in an automatic way as those processes are time consuming and require extreme attention to details. - Transparency: All submitted results would need to be accompanied by detailed information about the testing environment, hardware specifications, and methodology used. This would ensure reproducibility and allow others to understand the context of each result. - Trust Building: The combination of vendor and community approval would help establish trust in the reference values. It would mitigate concerns about marketing bias and provide a more reliable basis for performance comparisons. - Accessibility: The system would be publicly accessible, allowing anyone to reference and utilise this data in their own testing and analysis. Implementation of such a system would require careful consideration of governance and funding. A community-driven, non-profit organisation sponsored by multiple stakeholders could be an appropriate model. This structure would help maintain neutrality and avoid potential conflicts of interest. While the specifics of building and managing such a system would need further exploration, this approach could significantly improve the reliability and usefulness of reference values in benchmark testing. It would foster a more collaborative and transparent environment for performance evaluation in the Linux ecosystem as well as attract interested vendors to submit and review results. I’m not very informed about the current state of the community in this field, but I’m sure you know better how exactly this can be done. # Results Data Lake Along with reference values it’s important to collect results on a regular basis as the kernel evolves so results must follow this evolution as well. To do this cloud-based data lake is needed (a self-hosted system will be too expensive from my point of view). This data lake should be able to collect and process incoming data as well as to serve reference values for users. Data processing flow should be quite standard: Collection -> Parsing + Enhancement -> Storage -> Analysis -> Serving. Tim proposed to use file names for reference files, I would like to note that such approach could fail pretty fast if system will collect more and more data and there will rise a need to have more granular and detailed features to identify reference results and this can lead to very long filenames, which will be hard to use. I propose to use UUID4-based identification, which provides very low chances for collision. Those IDs will be keys in the database with all information required for clear identification of relevant results and corresponding details. Moreover this approach can be easily extended on the database side if more data is needed. Yes, UUID4 is not human-readable, but do we need such an option if we have tools, which can provide a better interface? For example, this could be something like: --- request: results-cli search -b "Test Suite D" -v "v1.2.3" -o "Ubuntu 22.04" -t "baseline" -m "response_time>100" response: [ { "id": "550e8400-e29b-41d4-a716-446655440005", "benchmark": "Test Suite A", "version": "v1.2.3", "target_os": "Ubuntu 22.04", "metrics": { "cpu_usage": 70.5, "memory_usage": 2048, "response_time": 120 }, "tags": ["baseline", "v1.0"], "created_at": "2024-10-25T10:00:00Z" }, ... ] --- or request: results-cli search "<Domain-Specific-Language-Query>" response: [ {}, {}, {}...] --- or request: results-cli get 550e8400-e29b-41d4-a716-446655440005 response: { "id": "550e8400-e29b-41d4-a716-446655440005", "benchmark": "Test Suite A", "version": "v1.2.3", "target_os": "Ubuntu 22.04", "metrics": { "cpu_usage": 70.5, "memory_usage": 2048, "response_time": 120 }, "tags": ["baseline", "v1.0"], "created_at": "2024-10-25T10:00:00Z" } --- or request: curl -X POST http://api.example.com/references/search \ -d '{ "query": "benchmark = \"Test Suite A\" AND (version >= \"v1.2\" OR tag IN [\"baseline\", \"regression\"]) AND cpu_usage > 60" }' ... --- Another point of use DB-based approach is the following: in case when a user works with particular hardware and/or would like to use a reference he/she does not need a full database with all collected reference values, but only a small slice of it. This slice can be downloaded from public repo or accessed via API. # Large results dataset If we collect a large benchmarks dataset in one place accompanied with detailed information about target systems from which this dataset was collected, then it will allow us to calculate precise baselines across different compositions of parameters, making performance deviations easier to detect. Long-term trend analysis can identify small changes and correlate them with updates, revealing performance drift. Another use of such a database - predictive modelling, which can provide forecasts of expected results and setting dynamic performance thresholds, enabling early issue detection. Anomaly detection becomes more effective with context, distinguishing unusual deviations from normal behaviour. # Interpretation of contradictory results It’s not clear how to deal with contradictory results to make a decision on regression presence. For example, we have a set of 10 tests, which test more or less the same, for example disk performance. It’s unclear what to do when one subset of tests show degradation and another subset shows neutral status or improvements. Is there a regression? I suppose that the availability of historical data can help to deal with such situations as historical data can show behaviour of particular tests and allow to assign weights in decision-making algorithms, but it’s just my guess. # System Profiles Tim's idea to reduce results to - “pass / fail” and my experience with various people trying to interpret benchmarking results led me to think of “profiles” - a set of parameters and metrics collected from a reference system while execution of a particular configuration of a particular benchmark. Profiles can be used for A/B comparison with pass/fail outcomes or match/not match, and this approach does not hide/miss the details and allows to capture multiple characteristics of the experiment, like presence of outliers/errors or skewed distribution form. Interested persons (like kernel developers or performance engineers, for example) can dig deeper to find a reason for such mismatch and those who are interested just in high-level results - pass/fail should be enough. Here is how I imaging a structure of a profile: --- profile_a: system packages: - pkg_1 - pkg_2 # Additional packages... settings: - cmdline # Additional settings... indicators: cpu: null ram: null loadavg: null # Additional indicators... benchmark: settings: param_1: null param_2: null param_x: null metrics: metric_1: null metric_2: null metric_x: null --- - System Packages, System Settings: Usually we do not pay much attention to this, but I think it’s worth highlighting that base OS is an important factor, as there are distribution-specific modifications present in the filesystem. Most commonly developers and researchers use Ubuntu (as the most popular distro) or Debian (as a cleaner and lightweight version of Ubuntu), but distributions apply their own patches to the kernel and system libraries, which may impact performance. Another kind of base OS - cloud OS images which can be modified by cloud providers to add internal packages & services which could potentially affect performance as well. While comparing we must take into account this aspect to compare apples-to-apples. - System Indicators: These are periodic statistics like CPU utilisation, RAM consumption, and other params collected before benchmarking, while benchmarking and after benchmarking. - Benchmark Settings: Benchmarking systems have multiple parameters, so it’s important to capture them and use them in analysis. - Benchmark Metrics: That’s obviously - benchmark results. It’s not a rare case when a benchmark test provides more than a single number. # Analysis Proposed rules-based analysis will work only for highly determined environments and systems, where rules can describe all the aspects. Rule-based systems are easier to understand and implement than other types, but for a small set of rules. However, we deal with the live system and it constantly evolves, so rules will deprecate extremely fast. It's the same story as with rule-based recommended systems in early years of machine learning. If you want to follow a rules-based approach, it's probably worth taking a look at https://www.clipsrules.net as this will allow to decouple results from analysis and avoid reinventing the analysis engine. Declaration of those rules will be error-prone due to the nature of their origin - they must be declared and maintained by humans. IMHO a human-less approach and use modern ML methods instead would be more beneficial in the long run. # Methodology Used methodology - another aspect which is not directly related to Tim's slides, but it's an important topic for results processing and interpretation, probably an idea of automated results interpretation can force use of one or another methodology. # Next steps I would be glad to participate in further discussions and share experience to improve kernel performance testing automation, analysis and interpretation of results. If there is interest, I'm open to collaborating on implementing some of these ideas. -- Best regards, Konstantin Belov