On Wed, Dec 02, 2020 at 04:57:31PM +0100, Thomas Gleixner wrote: > On Mon, Nov 30 2020 at 09:31, Christoph Lameter wrote: > > On Fri, 27 Nov 2020, Marcelo Tosatti wrote: > > > >> Decided to switch to prctl interface, and then it starts > >> to become similar to "task mode isolation" patchset API. > > > > Right I think that was a good approach. > > prctl() is the right thing to do. > > >> In addition to quiescing pending activities on the CPU, it would > >> also be useful to assign a per-task attribute (which is then assigned > >> to a per-CPU attribute), indicating whether that CPU is running > >> an isolated task or not. > > > > Sounds good but what would this do? Give a warning like the isolation > > patchset? > > This all needs a lot more thought about the overall picture. We already > have too many knobs and ad hoc hooks which fiddle with isolation. Well this would mainly add "task->is_isolated" and therefore (when given task is scheduled on CPU), "cpu->is_running_isolated_task". One can then use that information from the individual places in the kernel to make decisions. > The current CPU isolation is a best effort approach and I agree that for > more strict isolation modes we need to be able to enforce that and hunt > down offenders and think about them one by one. What we see in practice, for vRAN scenarios is the following: There are per-application and even "per-installation" latency requirements (lets say, considering a packet travelling a number of forwarding hosts). Depending on: * The end to end desired packet latency. * The number of links, and latency of each link. One can give each forwarding node an "allowed interruption sum per fixed period". Or, for 4G/5G vRAN application requirement (due to HARQ response times, see http://dl.ifip.org/db/conf/icin/icin2016/1570230300.pdf), so far that is 100us per 1ms, or 50us per 1ms (on the safe side of the 100us) going down to 20us per 1ms. If the kernel can keep: 1) How long each interruption took, that happened in a given period. 2) If the sum of those interruptions crosses a given threshold. It would be useful to get traces of those interruptions (tracing this manually, on the field, _after_ reports of excessive latency on the application is quite difficult to do). A scheme which parses trace entries in userspace would only trigger after processing the trace entries (and finding out about the over-threshold interruption). So it would be: 1) Asynchronous. 2) Application would not be informed. But it seems that a "0us tolerance with synchronous notifications" (task mode patchset) is also a valid use-case. > >> To be called before real time loop, one would have: > > Can we please agree in the first place, that "real time" is absolutely > the wrong term here? > > It's about running undisturbed CPU bound computations whatever nature > they are. It does not matter whether that loop does busy polling ala > DPDK, whether it runs a huge math computation on a data set or > whatever people come up with. Agree. Should say "interruption length". > > >> prctl(PR_SET_TASK_ISOLATION, ISOLATION_ENABLE) [1] > >> real time loop > >> prctl(PR_SET_TASK_ISOLATION, ISOLATION_DISABLE) > >> > >> (with the attribute also being cleared on task exit). > >> > >> The general description would be: > >> > >> "Set task isolated mode for a given task, returning an error > >> if the task is not pinned to a single CPU. > > Plus returning an error if the task has no permissions to request > this. This should not be an unprivileged prctl ever. > > >> In this mode, the kernel will avoid interruptions to isolated > >> CPUs when possible." > >> > >> Any objections against such an interface ? > > > > Maybe do both like in the isolation patchset? > > We really want to define the scopes first. And here you go: > > > Often code can tolerate a few interruptions (in some code branches > > regular syscalls may be needed) but one wants the thread to be > > as quiet as possible. > > So you say some code can tolerate a few interrupts, then comes Alex and > says 'no disturbance' at all. Some setups can, some don't. > The point is that all of this shares the mechanisms to quiesce certain > parts of the kernel so this wants to build common infrastructure and the > prctl(ISOLATION, MODE) mode argument defines the scope of isolation > which the task asks for and the infrastructure decides whether it can be > granted and if so orchestrates the operation and provides a common > infrastructure for instrumentation, violation monitoring etc. > > We really need to stop to look at particular workloads and defining > adhoc solutions tailored to their particular itch if we don't want to > end up with an uncoordinated and unmaintainable zoo of interfaces, hooks > and knobs. Agree. > Just looking at the problem at hand as an example. NOHZ already issues > quiet_vmstat(), but it does not cancel already scheduled work. Now > Marcelo wants a new mechanism which is supposed to cancel the work and > then Alex want's to prevent it from being rescheduled. If that's not > properly coordinated this goes down the drain very fast. > > So can we please come up with a central place to handle this prctl() > with a future proof argument list so the various isolation needs can be > expressed as required? > > That allows Marcelo to start tackling the vmstat side and Alex can > utilize that and build the other parts into it piece by piece. > > Thanks, > > tglx Thanks Thomas. I think need to discuss/think on the reporting via tracing infrastructure too... Maybe Arnaldo has some ideas. Alex how is this (in general) ? Should be able to enable the signalling with it, if i am not mistaken. Will post a version with the vmstat sync shortly. Index: linux-2.6-vmstat2/include/uapi/linux/prctl.h =================================================================== --- linux-2.6-vmstat2.orig/include/uapi/linux/prctl.h +++ linux-2.6-vmstat2/include/uapi/linux/prctl.h @@ -247,4 +247,9 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +/* Task isolation control */ +#define PR_TASK_ISOLATION_FEATURES 59 +#define PR_TASK_ISOLATION_GET 60 +#define PR_TASK_ISOLATION_SET 61 + #endif /* _LINUX_PRCTL_H */ Index: linux-2.6-vmstat2/kernel/sys.c =================================================================== --- linux-2.6-vmstat2.orig/kernel/sys.c +++ linux-2.6-vmstat2/kernel/sys.c @@ -58,6 +58,7 @@ #include <linux/sched/coredump.h> #include <linux/sched/task.h> #include <linux/sched/cputime.h> +#include <linux/isolation.h> #include <linux/rcupdate.h> #include <linux/uidgid.h> #include <linux/cred.h> @@ -2530,6 +2531,49 @@ SYSCALL_DEFINE5(prctl, int, option, unsi error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_TASK_ISOLATION_FEATURES: + struct isolation_features ifeat; + + memset(&ifeat, 0, sizeof(ifeat)); + + error = prctl_task_isolation_features(ifeat); + if (copy_to_user((char __user *)arg2, &ifeat, sizeof(ifeat))) + return -EFAULT; + break; + + case PR_TASK_ISOLATION_SET: + struct isolation_control *icontrol; + + icontrol = kzalloc(sizeof(struct isolation_control), GFP_KERNEL); + if (!icontrol) + return -ENOMEM; + + if (copy_from_user(icontrol, (char __user *)arg2, + sizeof(struct isolation_control))) { + kfree(icontrol); + return -EFAULT; + } + + error = prctl_task_isolation_set(icontrol); + kfree(icontrol); + break; + case PR_TASK_ISOLATION_GET: + struct isolation_control *icontrol; + + icontrol = kzalloc(sizeof(struct isolation_control), GFP_KERNEL); + if (!icontrol) + return -ENOMEM; + + error = prctl_task_isolation_get(icontrol); + + if (copy_to_user(icontrol, (char __user *)arg2, + sizeof(struct isolation_control))) { + kfree(icontrol); + return -EFAULT; + } + + kfree(icontrol); + break; default: error = -EINVAL; break; Index: linux-2.6-vmstat2/Documentation/userspace-api/task_isolation.rst =================================================================== --- /dev/null +++ linux-2.6-vmstat2/Documentation/userspace-api/task_isolation.rst @@ -0,0 +1,59 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================ +Task isolation CPU interface +============================ + +The kernel might perform a number of activities in the background, +on a given CPU, in the form of workqueues or interrupts. + +This interface allows userspace to indicate to the kernel when +its running latency critical code. This allows the system to +take preventive measures to avoid deferred actions and +create a OS noise free environment for the application. + +Usage +===== +``PR_TASK_ISOLATION_FEATURES``: + Returns the supported features. Features are defined + at include/uapi/linux/isolation.h. + + Usage:: + + prctl(PR_TASK_ISOLATION_FEATURES, ifeat, 0, 0, 0); + + The 'ifeat' argument is a pointer to a struct isolation_features: + + struct isolation_features { + __u32 flags; + __u32 pad[3]; + }; + + Where flags contains bit sets for the features the kernel supports. + + FIXME: describe error codes + +``PR_TASK_ISOLATION_SET``: + Enables task isolation mode. The type of task isolation is specified + via + + struct isolation_control { + __u32 flags; + __u32 pad[15]; + }; + +``PR_TASK_ISOLATION_GET``: + Retrieves the currently configured task isolation mode and + parameters. + + +(OUTDATED) After initialization, and before the latency sensitive loop, application can call: + +(OUTDATED) prctl(PR_QUIESCE_CPU, 0, 0, 0, 0); + +Example +======= + +The ``samples/quiesce_cpu/`` directory contains a sample +application. + Index: linux-2.6-vmstat2/include/uapi/linux/isolation.h =================================================================== --- /dev/null +++ linux-2.6-vmstat2/include/uapi/linux/isolation.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +/* For PR_TASK_ISOLATION_FEATURES */ +struct isolation_features { + __u32 flags; + __u32 pad[3]; +}; + +/* For PR_TASK_ISOLATION_GET / PR_TASK_ISOLATION_SET */ +struct isolation_control { + __u32 flags; + __u32 pad[15]; +}; + +/* + * Quiesce pending background activity, such as + * flushing per-CPU VM statistics. + */ + +#define TASK_ISOLATION_QUIESCE_CPU 0x1 + Index: linux-2.6-vmstat2/kernel/isolation.c =================================================================== --- /dev/null +++ linux-2.6-vmstat2/kernel/isolation.c @@ -0,0 +1,30 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Implementation of task isolation. + * + * Authors: + * Chris Metcalf <cmetcalf@xxxxxxxxxxxx> + * Alex Belits <abelits@xxxxxxxxxxx> + * Yuri Norov <ynorov@xxxxxxxxxxx> + * Marcelo Tosatti <mtosatti@xxxxxxxxxx> + */ + +#include <linux/sched.h> +#include <linux/isolation.h> + +int prctl_task_isolation_features(struct isolation_features *ifeat) +{ + ifeat->flags = TASK_ISOLATION_QUIESCE_CPU; + + return 0; +} + +int prctl_task_isolation_set(struct isolation_control *control) +{ + return 0; +} + +int prctl_task_isolation_get(struct isolation_control *control) +{ + return 0; +}