[RFC] ML infrastructure in Linux kernel

Viacheslav Dubeyko <slava@xxxxxxxxxxx> · Wed, 5 Jun 2024 14:02:19 +0300

Hello,

I would like to initiate a discussion related to an unified
infrastructure for ML workloads and user-space drivers.

[PROBLEM STATEMENT]

Last several years have revealed two important trends:
(1) moving kernel-space functionality into user-space drivers
(for example, SPDK, DPDK, ublk); (2) significant number of efforts
of using ML models for various real life applications (for example,
tuning kernel parameters, storage device failure prediction, fail
slow drive detection, and so on). Both trends represent significant
importance for the evolution of the Linux kernel. From one point of view,
user-space drivers represent the way of decreasing the latency and
improving the performance of operations. However, from another point of
view, the approach of bypassing the Linux kernel introduces security and
efficiency risks, potential synchronization issues of user-space threads,
and breaking the Linux kernel architecture’s paradigm. Generally speaking,
direct implementation of ML approaches in Linux kernel-space is very hard,
inefficient, and problematic because of practical unavailability of
floating point operations in the Linux kernel, and the computational power
hungry nature of ML algorithms (especially, during training phase).
It is possible to state that Linux kernel needs to introduce and to unify
an infrastructure as for ML approaches as for user-space drivers.

[WHY DO WE NEED ML in LINUX KERNEL?]

Do we really need a ML infrastructure in the Linux kernel? First of all,
it is really easy to imagine a lot of down to earth applications of ML
algorithms for automation of routine operations during working with
Linux kernel. Moreover, potentially, the ML subsystem could be used for
automated research and statistics gathering on the whole fleet of running
Linux kernels. Also, the ML subsystem is capable of writing documentation,
tuning kernel parameters on the fly, kernel recompilation, and even automated
reporting about bugs and crashes. Generally speaking, the ML subsystem
potentially can extend the Linux kernel capabilities. The main question is how?

[POTENTIAL INFRASTRUCTURE VISION]

Technically speaking , both cases (user-space driver and ML subsystem)
require a user-space functionality that can be considered as user-space
extension of Linux kernel functionality. Such approach is similar to microkernel
architecture by the minimal functionality on kernel side and the main
functionality on user-space side with the mandatory minimization
the number of context switches between kernel-space and user-space.
The key responsibility of kernel-side agent (or subsystem) is the accounting
of user-space extensions, synchronization of their access to shared resources
or metadata on kernel side, statistics gathering and sharing it through
the sysfs or specialized log file (likewise to syslog).

For example, such specialized log file(s) can be used by ML user-space
extensions for executing the ML algorithms with the goal of analyzing data
and available statistics. Generally speaking, the main ML logic can be executed
by extension(s) on the user-space side. This ML logic can elaborate
some “recommendations”, for example, that can  be shared with an ML agent
on the kernel side. As a result, the kernel-space ML agent can check
the shared “recommendations” and to apply the valid “recommendations”
by means of Linux kernel tuning, recompilation, “hot” restart and so on.
Technically speaking, the user-space driver requires pretty much the same
architecture as the simple kernel-space agent/subsystem and user-space
extension(s). The main functionality is on the user-space side and
kernel-space side delivers only accounting the user-space extensions,
allocating necessary resources, synchronizing access to shared resources,
and gathering statistics.

Generally speaking, such an approach implies the necessity of
registering a specialized driver class that could represent
an ML agent or user-space driver on kernel side. Then, it will be possible
to use a modprobe-like model to create an instance of ML agent or
user-space driver. Finally, we will have the kernel-space agent
that is connected to the user-space extension. The key point here is that
the user-space extension can directly communicate with a hardware device,
but the kernel-space side can account for the activity of the user-space
extension and allocates resources. It is possible to suggest an unified
architecture of the kernel-side agent that will be specialized by
the logic of user-space extension. But the logic of the kernel-space
agent should be minimal, simple, and unified as much as possible.
Technically speaking, the logic of kernel-space agent can be defined by
the eBPF program and eBPF arena (or shared memory between kernel-space
and user-space) can be used for interaction between the kernel-space
agent and the user-space extension. And such interaction could be implemented
through submission and completion queues, for example.

As a summary, described architecture is capable of implementing
ML infrastructure in Linux kernel and unification of user-space drivers
architecture.

Any opinion on this? How feasible could be such vision?

Thanks,
Slava.