[Cc linux-api as this is a new user interface] On Tue 26-01-21 22:51:28, Kalesh Singh wrote: > In order to measure how much memory a process actually consumes, it is > necessary to include the DMA buffer sizes for that process in the memory > accounting. Since the handle to DMA buffers are raw FDs, it is important > to be able to identify which processes have FD references to a DMA buffer. > > Currently, DMA buffer FDs can be accounted using /proc/<pid>/fd/* and > /proc/<pid>/fdinfo -- both of which are only root readable, as follows: > 1. Do a readlink on each FD. > 2. If the target path begins with "/dmabuf", then the FD is a dmabuf FD. > 3. stat the file to get the dmabuf inode number. > 4. Read/ proc/<pid>/fdinfo/<fd>, to get the DMA buffer size. > > Android captures per-process system memory state when certain low memory > events (e.g a foreground app kill) occur, to identify potential memory > hoggers. To include a process’s dmabuf usage as part of its memory state, > the data collection needs to be fast enough to reflect the memory state at > the time of such events. > > Since reading /proc/<pid>/fd/ and /proc/<pid>/fdinfo/ requires root > privileges, this approach is not suitable for production builds. Granting > root privileges even to a system process increases the attack surface and > is highly undesirable. Additionally this is slow as it requires many > context switches for searching and getting the dma-buf info. > > With the addition of per-buffer dmabuf stats in sysfs [1], the DMA buffer > details can be queried using their unique inode numbers. > > This patch proposes adding a /proc/<pid>/task/<tid>/dmabuf_fds interface. > > /proc/<pid>/task/<tid>/dmabuf_fds contains a list of inode numbers for > every DMA buffer FD that the task has. Entries with the same inode > number can appear more than once, indicating the total FD references > for the associated DMA buffer. > > If a thread shares the same files as the group leader then its > dmabuf_fds file will be empty, as these dmabufs are reported by the > group leader. > > The interface requires PTRACE_MODE_READ_FSCRED (same as /proc/<pid>/maps) > and allows the efficient accounting of per-process DMA buffer usage without > requiring root privileges. (See data below) > > Performance Comparison: > ----------------------- > > The following data compares the time to capture the sizes of all DMA > buffers referenced by FDs for all processes on an arm64 android device. > > ------------------------------------------------------- > | Core 0 (Little) | Core 7 (Big) | > ------------------------------------------------------- > >From <pid>/fdinfo | 318 ms | 145 ms | > ------------------------------------------------------- > Inodes from | 114 ms | 27 ms | > dmabuf_fds; | (2.8x ^) | (5.4x ^) | > data from sysfs | | | > ------------------------------------------------------- > > It can be inferred that in the worst case there is a 2.8x speedup for > determining per-process DMA buffer FD sizes, when using the proposed > interfaces. > > [1] https://lore.kernel.org/dri-devel/20210119225723.388883-1-hridya@xxxxxxxxxx/ > > Signed-off-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx> > --- > Documentation/filesystems/proc.rst | 30 ++++++ > drivers/dma-buf/dma-buf.c | 7 +- > fs/proc/Makefile | 1 + > fs/proc/base.c | 1 + > fs/proc/dma_bufs.c | 159 +++++++++++++++++++++++++++++ > fs/proc/internal.h | 1 + > include/linux/dma-buf.h | 5 + > 7 files changed, 198 insertions(+), 6 deletions(-) > create mode 100644 fs/proc/dma_bufs.c > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst > index 2fa69f710e2a..757dd47ab679 100644 > --- a/Documentation/filesystems/proc.rst > +++ b/Documentation/filesystems/proc.rst > @@ -47,6 +47,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@xxxxxxxxxxx> June 9 2009 > 3.10 /proc/<pid>/timerslack_ns - Task timerslack value > 3.11 /proc/<pid>/patch_state - Livepatch patch operation state > 3.12 /proc/<pid>/arch_status - Task architecture specific information > + 3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD > > 4 Configuring procfs > 4.1 Mount options > @@ -2131,6 +2132,35 @@ AVX512_elapsed_ms > the task is unlikely an AVX512 user, but depends on the workload and the > scheduling scenario, it also could be a false negative mentioned above. > > +3.13 /proc/<pid>/task/<tid>/dmabuf_fds - DMA buffers referenced by an FD > +------------------------------------------------------------------------- > +This file exposes a list of the inode numbers for every DMA buffer > +FD that the task has. > + > +The same inode number can appear more than once, indicating the total > +FD references for the associated DMA buffer. > + > +The inode number can be used to lookup the DMA buffer information in > +the sysfs interface /sys/kernel/dmabuf/buffers/<inode-no>/. > + > +Example Output > +~~~~~~~~~~~~~~ > +$ cat /proc/612/task/612/dmabuf_fds > +30972 30973 45678 49326 > + > +Permission to access this file is governed by a ptrace access mode > +PTRACE_MODE_READ_FSCREDS. > + > +Threads can have different files when created without specifying > +the CLONE_FILES flag. For this reason the interface is presented as > +/proc/<pid>/task/<tid>/dmabuf_fds and not /proc/<pid>/dmabuf_fds. > +This simplifies kernel code and aggregation can be handled in > +userspace. > + > +If a thread has the same files as its group leader, then its dmabuf_fds > +file will be empty as these dmabufs are already reported by the > +group leader. > + > Chapter 4: Configuring procfs > ============================= > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c > index 9ad6397aaa97..0660c06be4c6 100644 > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -29,8 +29,6 @@ > #include <uapi/linux/dma-buf.h> > #include <uapi/linux/magic.h> > > -static inline int is_dma_buf_file(struct file *); > - > struct dma_buf_list { > struct list_head head; > struct mutex lock; > @@ -434,10 +432,7 @@ static const struct file_operations dma_buf_fops = { > .show_fdinfo = dma_buf_show_fdinfo, > }; > > -/* > - * is_dma_buf_file - Check if struct file* is associated with dma_buf > - */ > -static inline int is_dma_buf_file(struct file *file) > +int is_dma_buf_file(struct file *file) > { > return file->f_op == &dma_buf_fops; > } > diff --git a/fs/proc/Makefile b/fs/proc/Makefile > index bd08616ed8ba..91a67f43ddf4 100644 > --- a/fs/proc/Makefile > +++ b/fs/proc/Makefile > @@ -16,6 +16,7 @@ proc-y += cmdline.o > proc-y += consoles.o > proc-y += cpuinfo.o > proc-y += devices.o > +proc-y += dma_bufs.o > proc-y += interrupts.o > proc-y += loadavg.o > proc-y += meminfo.o > diff --git a/fs/proc/base.c b/fs/proc/base.c > index b3422cda2a91..af15a60b9831 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -3598,6 +3598,7 @@ static const struct pid_entry tid_base_stuff[] = { > #ifdef CONFIG_SECCOMP_CACHE_DEBUG > ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache), > #endif > + REG("dmabuf_fds", 0444, proc_tid_dmabuf_fds_operations), > }; > > static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx) > diff --git a/fs/proc/dma_bufs.c b/fs/proc/dma_bufs.c > new file mode 100644 > index 000000000000..46ea9cf968ed > --- /dev/null > +++ b/fs/proc/dma_bufs.c > @@ -0,0 +1,159 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Per-process DMA-BUF Stats > + * > + * Copyright (C) 2021 Google LLC. > + */ > + > +#include <linux/dma-buf.h> > +#include <linux/fdtable.h> > +#include <linux/ptrace.h> > +#include <linux/seq_file.h> > + > +#include "internal.h" > + > +struct dmabuf_fds_private { > + struct inode *inode; > + struct task_struct *task; > + struct file *dmabuf_file; > +}; > + > +static loff_t *next_dmabuf(struct dmabuf_fds_private *priv, > + loff_t *pos) > +{ > + struct fdtable *fdt; > + struct file *file; > + > + rcu_read_lock(); > + fdt = files_fdtable(priv->task->files); > + for (; *pos < fdt->max_fds; ++*pos) { > + file = files_lookup_fd_rcu(priv->task->files, (unsigned int) *pos); > + if (file && is_dma_buf_file(file) && get_file_rcu(file)) { > + priv->dmabuf_file = file; > + break; > + } > + } > + if (*pos >= fdt->max_fds) > + pos = NULL; > + rcu_read_unlock(); > + > + return pos; > +} > + > +static void *dmabuf_fds_seq_start(struct seq_file *s, loff_t *pos) > +{ > + struct dmabuf_fds_private *priv = s->private; > + struct files_struct *group_leader_files; > + > + priv->task = get_proc_task(priv->inode); > + > + if (!priv->task) > + return ERR_PTR(-ESRCH); > + > + /* Hold task lock for duration that files need to be stable */ > + task_lock(priv->task); > + > + /* > + * If this task is not the group leader but shares the same files, leave file empty. > + * These dmabufs are already reported in the group leader's dmabuf_fds. > + */ > + group_leader_files = priv->task->group_leader->files; > + if (priv->task != priv->task->group_leader && priv->task->files == group_leader_files) { > + task_unlock(priv->task); > + put_task_struct(priv->task); > + priv->task = NULL; > + return NULL; > + } > + > + return next_dmabuf(priv, pos); > +} > + > +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos) > +{ > + ++*pos; > + return next_dmabuf(s->private, pos); > +} > + > +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v) > +{ > + struct dmabuf_fds_private *priv = s->private; > + > + if (priv->task) { > + task_unlock(priv->task); > + put_task_struct(priv->task); > + > + } > + if (priv->dmabuf_file) > + fput(priv->dmabuf_file); > +} > + > +static int dmabuf_fds_seq_show(struct seq_file *s, void *v) > +{ > + struct dmabuf_fds_private *priv = s->private; > + struct file *file = priv->dmabuf_file; > + struct dma_buf *dmabuf = file->private_data; > + > + if (!dmabuf) > + return -ESRCH; > + > + seq_printf(s, "%8lu ", file_inode(file)->i_ino); > + > + fput(priv->dmabuf_file); > + priv->dmabuf_file = NULL; > + > + return 0; > +} > + > +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = { > + .start = dmabuf_fds_seq_start, > + .next = dmabuf_fds_seq_next, > + .stop = dmabuf_fds_seq_stop, > + .show = dmabuf_fds_seq_show > +}; > + > +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file, > + const struct seq_operations *ops) > +{ > + struct dmabuf_fds_private *priv; > + struct task_struct *task; > + bool allowed = false; > + > + task = get_proc_task(inode); > + if (!task) > + return -ESRCH; > + > + allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); > + put_task_struct(task); > + > + if (!allowed) > + return -EACCES; > + > + priv = __seq_open_private(file, ops, sizeof(*priv)); > + if (!priv) > + return -ENOMEM; > + > + priv->inode = inode; > + priv->task = NULL; > + priv->dmabuf_file = NULL; > + > + return 0; > +} > + > +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file) > +{ > + return seq_release_private(inode, file); > +} > + > +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file) > +{ > + return proc_dmabuf_fds_open(inode, file, > + &proc_tid_dmabuf_fds_seq_ops); > +} > + > +const struct file_operations proc_tid_dmabuf_fds_operations = { > + .open = tid_dmabuf_fds_open, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = proc_dmabuf_fds_release, > +}; > + > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index f60b379dcdc7..4ca74220db9c 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -303,6 +303,7 @@ extern const struct file_operations proc_pid_smaps_operations; > extern const struct file_operations proc_pid_smaps_rollup_operations; > extern const struct file_operations proc_clear_refs_operations; > extern const struct file_operations proc_pagemap_operations; > +extern const struct file_operations proc_tid_dmabuf_fds_operations; > > extern unsigned long task_vsize(struct mm_struct *); > extern unsigned long task_statm(struct mm_struct *, > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h > index cf72699cb2bc..087e11f7f193 100644 > --- a/include/linux/dma-buf.h > +++ b/include/linux/dma-buf.h > @@ -27,6 +27,11 @@ struct device; > struct dma_buf; > struct dma_buf_attachment; > > +/** > + * Check if struct file* is associated with dma_buf. > + */ > +int is_dma_buf_file(struct file *file); > + > /** > * struct dma_buf_ops - operations possible on struct dma_buf > * @vmap: [optional] creates a virtual mapping for the buffer into kernel > -- > 2.30.0.280.ga3ce27912f-goog -- Michal Hocko SUSE Labs