[PATCH v2] kselftests: Add test to detect boot event slowdowns

Laura Nao <laura.nao@xxxxxxxxxxxxx> · Fri, 18 Oct 2024 12:14:39 +0200

Introduce a new kselftest to identify slowdowns in key boot events.
This test uses ftrace to monitor the start and end times, as well as
the durations of all initcalls, and compares these timings to reference
values to identify significant slowdowns.
The script functions in two modes: the 'generate' mode allows to create
a JSON file containing initial reference timings for all initcalls from
a known stable kernel. The 'test' mode can be used during subsequent
boots to assess current timings against the reference values and
determine if there are any significant differences.
The test ships with a bootconfig file for setting up ftrace and a
configuration fragment for the necessary kernel configs.

Signed-off-by: Laura Nao <laura.nao@xxxxxxxxxxxxx>
---
Hello,

This v2 is a follow-up to RFCv1[1] and includes changes based on feedback
from the LPC 2024 session [2], along with some other fixes.

[1] https://lore.kernel.org/all/20240725110622.96301-1-laura.nao@xxxxxxxxxxxxx/
[2] https://www.youtube.com/watch?v=rWhW2-Vzi40

After reviewing other available tests and considering the feedback from
discussions at Plumbers, I decided to stick with the bootconfig file
approach but extend it to track all initcalls instead of a fixed set of
functions or events. The bootconfig file can be expanded and adapted to
track additional functions if needed for specific use cases.

I also defined a synthetic event to calculate initcall durations, while
still tracking their start and end times. Users are then allowed to choose
whether to compare start times, end times, or durations. Support for
specifying different rules for different initcalls has also been added.

In RFCv1, there was some discussion about using existing tools like
bootgraph.py. However, the output from these tools is mainly for manual
inspection (e.g., HTML visual output), whereas this test is designed to run
in automated CI environments too. The kselftest proposed here combines the
process of generating reference data and running tests into a single script
with two modes, making it easy to integrate into automated workflows.

Many of the features in this v2 (e.g., generating a JSON reference file,
comparing timings, and reporting results in KTAP format) could potentially
be integrated into bootgraph.py with some effort.
However, since this test is intended for automated execution rather than
manual use, I've decided to keep it separate for now and explore the
options suggested at LPC, such as using ftrace histograms for initcall
latencies. I'm open to revisiting this decision and working toward
integrating the changes into bootgraph.py if there's a strong preference
for unifying the tools.

Let me know your thoughts.

A comprehensive changelog is reported below.

Thanks,

Laura
---
Changes in v2:
- Updated ftrace configuration to track all initcall start times, end
  times, and durations, and generate a histogram.
- Modified test logic to compare initcall durations by default, with the
  option to compare start or end times if needed.
- Added warnings if the initcalls in the reference file differ from those
  detected in the running system.
- Combined the scripts into a single script with two modes: one for
  generating the reference file and one for running the test.
- Added support for specifying different rules for individual initcalls.
- Switched the reference format from YAML to JSON.
- Added metadata to the reference file, including kernel version, kernel
  configuration, and cmdline.
- Link to v1: https://lore.kernel.org/all/20240725110622.96301-1-laura.nao@xxxxxxxxxxxxx/
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/boot-time/Makefile    |  16 ++
 tools/testing/selftests/boot-time/bootconfig  |  15 +
 tools/testing/selftests/boot-time/config      |   6 +
 .../selftests/boot-time/test_boot_time.py     | 265 ++++++++++++++++++
 5 files changed, 303 insertions(+)
 create mode 100644 tools/testing/selftests/boot-time/Makefile
 create mode 100644 tools/testing/selftests/boot-time/bootconfig
 create mode 100644 tools/testing/selftests/boot-time/config
 create mode 100755 tools/testing/selftests/boot-time/test_boot_time.py

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index b38199965f99..1bb20d1e3854 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -3,6 +3,7 @@ TARGETS += acct
 TARGETS += alsa
 TARGETS += amd-pstate
 TARGETS += arm64
+TARGETS += boot-time
 TARGETS += bpf
 TARGETS += breakpoints
 TARGETS += cachestat
diff --git a/tools/testing/selftests/boot-time/Makefile b/tools/testing/selftests/boot-time/Makefile
new file mode 100644
index 000000000000..cdcdc1bbe779
--- /dev/null
+++ b/tools/testing/selftests/boot-time/Makefile
@@ -0,0 +1,16 @@
+PY3 = $(shell which python3 2>/dev/null)
+
+ifneq ($(PY3),)
+
+TEST_PROGS := test_boot_time.py
+
+include ../lib.mk
+
+else
+
+all: no_py3_warning
+
+no_py3_warning:
+	@echo "Missing python3. This test will be skipped."
+
+endif
\ No newline at end of file
diff --git a/tools/testing/selftests/boot-time/bootconfig b/tools/testing/selftests/boot-time/bootconfig
new file mode 100644
index 000000000000..e4b89a33b7a3
--- /dev/null
+++ b/tools/testing/selftests/boot-time/bootconfig
@@ -0,0 +1,15 @@
+ftrace.event {
+    synthetic.initcall_latency {
+        # Synthetic event to record initcall latency, start, and end times
+        fields = "unsigned long func", "u64 lat", "u64 start", "u64 end"
+        actions = "hist:keys=func.sym,start,end:vals=lat:sort=lat"
+    }
+    initcall.initcall_start {
+        # Capture the start time (ts0) when initcall starts
+        actions = "hist:keys=func:ts0=common_timestamp.usecs"
+    }
+    initcall.initcall_finish {
+        # Capture the end time, calculate latency, and trigger synthetic event
+        actions = "hist:keys=func:lat=common_timestamp.usecs-$ts0:start=$ts0:end=common_timestamp.usecs:onmatch(initcall.initcall_start).initcall_latency(func,$lat,$start,$end)"
+    }
+}
\ No newline at end of file
diff --git a/tools/testing/selftests/boot-time/config b/tools/testing/selftests/boot-time/config
new file mode 100644
index 000000000000..bcb646ec3cd8
--- /dev/null
+++ b/tools/testing/selftests/boot-time/config
@@ -0,0 +1,6 @@
+CONFIG_TRACING=y
+CONFIG_BOOTTIME_TRACING=y
+CONFIG_BOOT_CONFIG_EMBED=y
+CONFIG_BOOT_CONFIG_EMBED_FILE="tools/testing/selftests/boot-time/bootconfig"
+CONFIG_SYNTH_EVENTS=y
+CONFIG_HIST_TRIGGERS=y
\ No newline at end of file
diff --git a/tools/testing/selftests/boot-time/test_boot_time.py b/tools/testing/selftests/boot-time/test_boot_time.py
new file mode 100755
index 000000000000..556dacf04b6d
--- /dev/null
+++ b/tools/testing/selftests/boot-time/test_boot_time.py
@@ -0,0 +1,265 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (c) 2024 Collabora Ltd
+#
+# This script reads the
+# /sys/kernel/debug/tracing/events/synthetic/initcall_latency/hist file,
+# extracts function names and timings, and compares them against reference
+# timings provided in an input JSON file to identify significant boot
+# slowdowns.
+# The script operates in two modes:
+# - Generate Mode: parses initcall timings from the current kernel's ftrace
+#   event histogram and generates a JSON reference file with function
+#   names, start times, end times, and latencies.
+# - Test Mode: compares current initcall timings against the reference
+#   file, allowing users to define a maximum allowed difference between the
+#   values (delta). Users can also apply custom delta thresholds for
+#   specific initcalls using regex-based overrides. The comparison can be
+#   done on latency, start, or end times.
+#
+
+import os
+import sys
+import argparse
+import gzip
+import json
+import re
+import subprocess
+
+this_dir = os.path.dirname(os.path.realpath(__file__))
+sys.path.append(os.path.join(this_dir, "../kselftest/"))
+
+import ksft
+
+def load_reference_from_json(file_path):
+    """
+    Load reference data from a JSON file and returns the parsed data.
+    @file_path: path to the JSON file.
+    """
+
+    try:
+        with open(file_path, 'r', encoding="utf-8") as file:
+            return json.load(file)
+    except FileNotFoundError:
+        ksft.print_msg(f"Error: File {file_path} not found.")
+        ksft.exit_fail()
+    except json.JSONDecodeError:
+        ksft.print_msg(f"Error: Failed to decode JSON from {file_path}.")
+        ksft.exit_fail()
+
+
+def mount_debugfs(path):
+    """
+    Mount debugfs at the specified path if it is not already mounted.
+    @path: path where debugfs should be mounted
+    """
+    # Check if debugfs is already mounted
+    with open('/proc/mounts', 'r', encoding="utf-8") as mounts:
+        for line in mounts:
+            if 'debugfs' in line and path in line:
+                print(f"debugfs is already mounted at {path}")
+                return True
+
+    # Mount debugfs
+    try:
+        subprocess.run(['mount', '-t', 'debugfs', 'none', path], check=True)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"Failed to mount debugfs: {e.stderr}")
+        return False
+
+
+def ensure_unique_function_name(func, initcall_entries):
+    """
+    Ensure the function name is unique by appending a suffix if necessary.
+    @func: the original function name.
+    @initcall_entries: a dictionary containing parsed initcall entries.
+    """
+    i = 2
+    base_func = func
+    while func in initcall_entries:
+        func = f'{base_func}[{i}]'
+        i += 1
+    return func
+
+
+def parse_initcall_latency_hist():
+    """
+    Parse the ftrace histogram for the initcall_latency event, extracting
+    function names, start times, end times, and latencies. Return a
+    dictionary where each entry is structured as follows:
+    {
+        <function symbolic name>: {
+            "start": <start time>,
+            "end": <end time>,
+            "latency": <latency>
+        }
+    }
+    """
+
+    pattern = re.compile(r'\{ func: \[\w+\] ([\w_]+)\s*, start: *(\d+), end: *(\d+) \} hitcount: *\d+  lat: *(\d+)')
+    initcall_entries = {}
+
+    try:
+        with open('/sys/kernel/debug/tracing/events/synthetic/initcall_latency/hist', 'r', encoding="utf-8") as hist_file:
+            for line in hist_file:
+                match = pattern.search(line)
+                if match:
+                    func = match.group(1).strip()
+                    start = int(match.group(2))
+                    end = int(match.group(3))
+                    latency = int(match.group(4))
+
+                    # filter out unresolved names
+                    if not func.startswith("0x"):
+                        func = ensure_unique_function_name(func, initcall_entries)
+
+                        initcall_entries[func] = {
+                            "start": start,
+                            "end": end,
+                            "latency": latency
+                        }
+    except FileNotFoundError:
+        print("Error: Histogram file not found.")
+
+    return initcall_entries
+
+
+def compare_initcall_list(ref_initcall_entries, cur_initcall_entries):
+    """
+    Compare the current list of initcall functions against the reference
+    file. Print warnings if there are unique entries in either.
+    @ref_initcall_entries: reference initcall entries.
+    @cur_initcall_entries: current initcall entries.
+    """
+    ref_entries = set(ref_initcall_entries.keys())
+    cur_entries = set(cur_initcall_entries.keys())
+
+    unique_to_ref = ref_entries - cur_entries
+    unique_to_cur = cur_entries - ref_entries
+
+    if (unique_to_ref):
+        ksft.print_msg(
+            f"Warning: {list(unique_to_ref)} not found in current data. Consider updating reference file.")
+    if unique_to_cur:
+        ksft.print_msg(
+            f"Warning: {list(unique_to_cur)} not found in reference data. Consider updating reference file.")
+
+
+def run_test(ref_file_path, delta, overrides, mode):
+    """
+    Run the test comparing the current timings with the reference values.
+    @ref_file_path: path to the JSON file containing reference values.
+    @delta: default allowed difference between reference and current
+    values.
+    @overrides: override rules in the form of regex:threshold.
+    @mode: the comparison mode (either 'start', 'end', or 'latency').
+    """
+
+    ref_data = load_reference_from_json(ref_file_path)
+
+    ref_initcall_entries = ref_data['data']
+    cur_initcall_entries = parse_initcall_latency_hist()
+
+    compare_initcall_list(ref_initcall_entries, cur_initcall_entries)
+
+    ksft.set_plan(len(ref_initcall_entries))
+
+    for func_name in ref_initcall_entries:
+        effective_delta = delta
+        for regex, override_delta in overrides.items():
+            if re.match(regex, func_name):
+                effective_delta = override_delta
+                break
+        if (func_name in cur_initcall_entries):
+            ref_metric = ref_initcall_entries[func_name].get(mode)
+            cur_metric = cur_initcall_entries[func_name].get(mode)
+            if (cur_metric > ref_metric and (cur_metric - ref_metric) >= effective_delta):
+                ksft.test_result_fail(func_name)
+                ksft.print_msg(f"'{func_name}' {mode} differs by "
+                               f"{(cur_metric - ref_metric)} usecs.")
+            else:
+                ksft.test_result_pass(func_name)
+        else:
+            ksft.test_result_skip(func_name)
+
+
+def generate_reference_file(file_path):
+    """
+    Generate a reference file in JSON format, containing kernel metadata
+    and initcall timing data.
+    @file_path: output file path.
+    """
+    metadata = {}
+
+    config_file = "/proc/config.gz"
+    if os.path.isfile(config_file):
+        with gzip.open(config_file, "rt", encoding="utf-8") as f:
+            config = f.read()
+            metadata["config"] = config
+
+    metadata["version"] = os.uname().release
+
+    cmdline_file = "/proc/cmdline"
+    if os.path.isfile(cmdline_file):
+        with open(cmdline_file, "r", encoding="utf-8") as f:
+            cmdline = f.read().strip()
+            metadata["cmdline"] = cmdline
+
+    ref_data = {
+        "metadata": metadata,
+        "data": parse_initcall_latency_hist(),
+    }
+
+    with open(file_path, "w", encoding='utf-8') as f:
+        json.dump(ref_data, f, indent=4)
+        print(f"Generated {file_path}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="")
+
+    subparsers = parser.add_subparsers(dest='mode', required=True, help='Choose between generate or test modes')
+
+    generate_parser = subparsers.add_parser('generate', help="Generate a reference file")
+    generate_parser.add_argument('out_ref_file', nargs='?', default='reference_initcall_timings.json',
+                                 help='Path to output JSON reference file (default: reference_initcall_timings.json)')
+
+    compare_parser = subparsers.add_parser('test', help='Test against a reference file')
+    compare_parser.add_argument('in_ref_file', help='Path to JSON reference file')
+    compare_parser.add_argument(
+        'delta', type=int, help='Maximum allowed delta between the current and the reference timings (usecs)')
+    compare_parser.add_argument('--override', '-o', action='append', type=str,
+                                help="Specify regex-based rules as regex:delta (e.g., '^acpi_.*:50')")
+    compare_parser.add_argument('--mode', '-m', default='latency', choices=[
+                                'start', 'end', 'latency'],
+                                help="Comparison mode: 'latency' (default) for latency, 'start' for start times, or 'end' for end times.")
+
+    args = parser.parse_args()
+
+    if args.mode == 'generate':
+        generate_reference_file(args.out_ref_file)
+        sys.exit(0)
+
+    # Process overrides
+    overrides = {}
+    if args.override:
+        for override in args.override:
+            try:
+                pattern, delta = override.split(":")
+                overrides[pattern] = int(delta)
+            except ValueError:
+                print(f"Invalid override format: {override}. Expected format is 'regex:delta'.")
+                sys.exit(1)
+
+    # Ensure debugfs is mounted
+    if not mount_debugfs("/sys/kernel/debug"):
+        ksft.exit_fail()
+
+    ksft.print_header()
+
+    run_test(args.in_ref_file, args.delta, overrides, args.mode)
+
+    ksft.finished()
-- 
2.30.2