Resending RFC. This patchset is not final. I am looking for feedback on this proposal to share thread specific data for us in latency sensitive codepath. (patchset based on v5.14-rc7) Cover letter previously sent: ---------------------------- Some applications, like a Databases require reading thread specific stats frequently from the kernel in latency sensitive codepath. The overhead of reading stats from kernel using system call affects performance. One use case is reading thread's scheduler stats from /proc schedstat file (/proc/pid/schedstat) to collect time spent by a thread executing on the cpu(sum_exec_runtime), time blocked waiting on runq(run_delay). These scheduler stats, read several times per transaction in latency-sensitive codepath, are used to measure time taken by DB operations. This patch proposes to introduce a mechanism for kernel to share thread stats thru a per thread shared structure shared between userspace and kernel. The per thread shared structure is allocated on a page shared mapped between user space and kernel, which will provide a way for fast communication between user and kernel. Kernel publishes stats in this shared structure. Application thread can read from it in user space without requiring system calls. Similarly, there can be other use cases for such shared structure mechanism. Introduce 'off cpu' time: The time spent executing on a cpu(sum_exec_runtime) by a thread, currently available thru thread's schedstat file, can be shared thru the shared structure mentioned above. However, when a thread is running on the cpu, this time gets updated periodically, can take upto 1ms or more as part of scheduler tick processing. If the application has to measure cpu time consumed across some DB operations, using 'sum_exec_runtime' will not be accurate. To address this the proposal is to introduce a thread's 'off cpu' time, which is measured at context switch, similar to time on runq(ie run_delay in schedstat file) is and should be more accurate. With that the application can determine cpu time consumed by taking the elapsed time and subtracting off cpu time. The off cpu time will be made available thru the shared structure along with the other schedstats from /proc/pid/schedstat file. The elapsed time itself can be measured using clock_gettime, which is vdso optimized and would be fast. The schedstats(runq time & off cpu time) published in the shared structure will be accumulated time, same as what is available thru schedstat file, all in units of nanoseconds. The application would take the difference of the values from before and after the operation for measurement. Preliminary results from a simple cached read Database workload shows performance benefit, when the database uses shared struct for reading stats vs reading from /proc directly. Implementation: A new system call is added to request use of shared structure by a user thread. Kernel will allocate page(s), shared mapped with user space in which per-thread shared structures will be allocated. These structures are padded to 128 bytes. This will contain struct members or nested structures corresponding to supported stats, like the thread's schedstats, published by the kernel for user space consumption. More struct members can be added as new feature support is implemented. Multiple such shared structures will be allocated from a page(upto 32 per 4k page) and avoid having to allocate one page per thread of a process. Although, will need optimizing for locality. Additional pages will be allocated as needed to accommodate more threads requesting use of shared structures. Aim is to not expose the layout of the shared structure itself to the application, which will allow future enhancements/changes without affecting the existing APIs. The system call will return a pointer(user space mapped address) to the per thread shared structure members. Application would save this per thread pointer in a TLS variable and reference it. The system call is of the form. int task_getshared(int option, int flags, void __user *uaddr) // Currently only TASK_SCHEDSTAT option is supported - returns pointer // to struct task_schedstat. The struct task_schedstat is nested within // the shared structure. struct task_schedstat { volatile u64 sum_exec_runtime; volatile u64 run_delay; volatile u64 pcount; volatile u64 off_cpu; }; Usage: __thread struct task_schedstat *ts; task_getshared(TASK_SCHEDSTAT, 0, &ts); Subsequently the stats are accessed using the 'ts' pointer by the thread Prakash Sangappa (3): Introduce per thread user-kernel shared structure Publish tasks's scheduler stats thru the shared structure Introduce task's 'off cpu' time arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/mm_types.h | 2 + include/linux/sched.h | 9 + include/linux/syscalls.h | 2 + include/linux/task_shared.h | 92 ++++++++++ include/uapi/asm-generic/unistd.h | 5 +- include/uapi/linux/task_shared.h | 23 +++ kernel/fork.c | 7 + kernel/sched/deadline.c | 1 + kernel/sched/fair.c | 1 + kernel/sched/rt.c | 1 + kernel/sched/sched.h | 1 + kernel/sched/stats.h | 55 ++++-- kernel/sched/stop_task.c | 1 + kernel/sys_ni.c | 3 + mm/Makefile | 2 +- mm/task_shared.c | 314 +++++++++++++++++++++++++++++++++ 18 files changed, 501 insertions(+), 20 deletions(-) create mode 100644 include/linux/task_shared.h create mode 100644 include/uapi/linux/task_shared.h create mode 100644 mm/task_shared.c -- 2.7.4