On Fri, Sep 27, 2024 at 06:13:40PM +0800, Yipeng Zou wrote: > Hi everyone, > > I am currently working on a patch for a CPU frequency governor based on > BPF, which can use BPF to customize and implement various frequency > scaling strategies. > > If you have any feedback or suggestions, please do let me know. > > Motivation > ---------- > > 1. Customization > > Existing cpufreq governors in the kernel are designed for general > scenarios, which may not always be optimal for specific or specialized > workloads. > > The userspace governor allows direct control over cpufreq, but users > often require guidance from the kernel to achieve the desired frequency. > > Cpufreq_ext aims to address this by providing a customizable framework that > can be tailored to the unique needs of different systems and applications. > > While cpufreq governors can be implemented within a kernel module, > maintaining a ko tailored for specific scenarios can be challenging. > The complexity and overhead associated with kernel modules make it > difficult to quickly adapt and deploy custom frequency scaling strategies. > > Cpufreq_ext leverages BPF to offer a more lightweight and flexible approach > to implementing customized strategies, allowing for easier maintenance and > deployment. > > 2. Integration with sched_ext: > > sched_ext is a scheduler class whose behavior can be defined by a set of > BPF programs - the BPF scheduler. > > Look for more about sched_ext in [1]: > > [1] https://www.kernel.org/doc/html/next/scheduler/sched-ext.html > > The interaction between CPU frequency scaling and task scheduling is > critical for performance. > > cpufreq_ext can work with sched_ext to ensure that both scheduling > decisions and frequency adjustments are made in a coordinated manner, > optimizing system responsiveness and power consumption. Hi Yipeng, I prototyped something really similar earlier this year and the conclusion I came to was that a governor might not be the right abstraction for struct_ops. One issue is that depending on the frequency driver being used it may have it governor implmentation included (ex: intel_pstate). For sched_ext there is already a kfunc (scx_bpf_cpuperf_set) which is a calls into cpufreq_update_util and that has been working well so far. > Overview > -------- > > The cpufreq ext is a BPF based cpufreq governor, we can customize > cpufreq governor in BPF program. > > CPUFreq ext works as common cpufreq governor with cpufreq policy. > > -------------------------- > | BPF governor | > -------------------------- > | > v > BPF Register > | > v > -------------------------------------- > | CPUFreq ext | > -------------------------------------- > ^ ^ ^ > | | | > --------- --------- --------- > | policy0 | ... | policy1 | ... | policyn | > --------- --------- --------- > > We can register serval function hooks to cpufreq ext by BPF Struct OPS. > > The first patch define a dbs_governor, and it's works like other > governor. > > The second patch gives a sample how to use it, implement one > typical cpufreq governor, switch to max cpufreq when VIP task > is running on target cpu. > > Detail > ------ > > The cpufreq ext use bpf_struct_ops to register serval function hooks. > > struct cpufreq_governor_ext_ops { > ... > } > > Cpufreq_governor_ext_ops defines all the functions that BPF programs can > implement customly. > > If you need to add a custom function, you only need to define it in this > struct. > > At the moment we have defined the basic functions. > > 1. unsigned long (*get_next_freq)(struct cpufreq_policy *policy) > > Make decision how to adjust cpufreq here. > The return value represents the CPU frequency that will be > updated. > > 2. unsigned int (*get_sampling_rate)(struct cpufreq_policy *policy) > > Make decision how to adjust sampling_rate here. > The return value represents the governor samplint rate that > will be updated. > Why does the governor need a sampling rate? Could this be done with a bpf timer instead? > 3. unsigned int (*init)(void) > > BPF governor init callback, return 0 means success. > > 4. void (*exit)(void) > > BPF governor exit callback. > > 5. char name[CPUFREQ_EXT_NAME_LEN] > > BPF governor name. > I'm guessing it would be useful to have the governor dispatch on almost all the governor methods. IIRC I had something like: int (*start)(struct cpufreq_policy *policy); void (*stop)(struct cpufreq_policy *policy); void (*limits)(struct cpufreq_policy *policy); int (*store_setspeed)(struct cpufreq_policy *policy, unsigned int freq); > The cpufreq_ext also add sysfs interface which refer to governor status. > > 1. ext/stat attribute: > > Access to current BPF governor status. > > # cat /sys/devices/system/cpu/cpufreq/ext/stat > Stat: CPUFREQ_EXT_INIT > BPF governor: performance > > There are number of constraints on the cpufreq_ext: > > 1. Only one ext governor can be registered at a time. > > 2. By default, it operates as a performance governor when no BPF > governor is registered. > > 3. The cpufreq_ext governor must be selected before loading a BPF > governor; otherwise, the installation of the BPF governor will fail. > > TODO > ---- > > The current patch is a starting point, and future work will focus on > expanding its capabilities. > > I plan to leverage the BPF ecosystem to introduce innovative features, > such as real-time adjustments and optimizations based on system-wide > observations and analytics. > > And I am looking forward to any insights, critiques, or suggestions you > may have. > > Yipeng Zou (2): > cpufreq_ext: Introduce cpufreq ext governor > cpufreq_ext: Add bpf sample > > drivers/cpufreq/Kconfig | 23 ++ > drivers/cpufreq/Makefile | 1 + > drivers/cpufreq/cpufreq_ext.c | 525 +++++++++++++++++++++++++++++++++ > samples/bpf/.gitignore | 1 + > samples/bpf/Makefile | 8 +- > samples/bpf/cpufreq_ext.bpf.c | 113 +++++++ > samples/bpf/cpufreq_ext_user.c | 48 +++ > 7 files changed, 718 insertions(+), 1 deletion(-) > create mode 100644 drivers/cpufreq/cpufreq_ext.c > create mode 100644 samples/bpf/cpufreq_ext.bpf.c > create mode 100644 samples/bpf/cpufreq_ext_user.c > > -- > 2.34.1 >