On 2019/09/10 11:00, Kirill A. Shutemov wrote: > On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote: >> There are several storage drivers like dm-multipath, iscsi, and nbd that >> have userspace components that can run in the IO path. For example, >> iscsi and nbd's userspace deamons may need to recreate a socket and/or >> send IO on it, and dm-multipath's daemon multipathd may need to send IO >> to figure out the state of paths and re-set them up. >> >> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the >> memalloc_*_save/restore functions to control the allocation behavior, >> but for userspace we would end up hitting a allocation that ended up >> writing data back to the same device we are trying to allocate for. >> >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags >> through procfs. It currently only supports PF_MEMALLOC_NOIO, but >> depending on what other drivers and userspace file systems need, for >> the final version I can add the other flags for that file or do a file >> per flag or just do a memalloc_noio file. >> >> Signed-off-by: Mike Christie <mchristi@xxxxxxxxxx> >> --- >> Documentation/filesystems/proc.txt | 6 ++++ >> fs/proc/base.c | 53 ++++++++++++++++++++++++++++++ >> 2 files changed, 59 insertions(+) >> >> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt >> index 99ca040e3f90..b5456a61a013 100644 >> --- a/Documentation/filesystems/proc.txt >> +++ b/Documentation/filesystems/proc.txt >> @@ -46,6 +46,7 @@ Table of Contents >> 3.10 /proc/<pid>/timerslack_ns - Task timerslack value >> 3.11 /proc/<pid>/patch_state - Livepatch patch operation state >> 3.12 /proc/<pid>/arch_status - Task architecture specific information >> + 3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior >> >> 4 Configuring procfs >> 4.1 Mount options >> @@ -1980,6 +1981,11 @@ Example >> $ cat /proc/6753/arch_status >> AVX512_elapsed_ms: 8 >> >> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior >> +----------------------------------------------------------------------- >> +A value of "noio" indicates that when a task allocates memory it will not >> +reclaim memory that requires starting phisical IO. >> + >> Description >> ----------- >> >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> index ebea9501afb8..c4faa3464602 100644 >> --- a/fs/proc/base.c >> +++ b/fs/proc/base.c >> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = { >> .llseek = default_llseek, >> }; >> >> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count, >> + loff_t *ppos) >> +{ >> + struct task_struct *task; >> + ssize_t rc = 0; >> + >> + task = get_proc_task(file_inode(file)); >> + if (!task) >> + return -ESRCH; >> + >> + if (task->flags & PF_MEMALLOC_NOIO) >> + rc = simple_read_from_buffer(buf, count, ppos, "noio", 4); >> + put_task_struct(task); >> + return rc; >> +} >> + >> +static ssize_t memalloc_write(struct file *file, const char __user *buf, >> + size_t count, loff_t *ppos) >> +{ >> + struct task_struct *task; >> + char buffer[5]; >> + int rc = count; >> + >> + memset(buffer, 0, sizeof(buffer)); >> + if (count != sizeof(buffer) - 1) >> + return -EINVAL; >> + >> + if (copy_from_user(buffer, buf, count)) >> + return -EFAULT; >> + buffer[count] = '\0'; >> + >> + task = get_proc_task(file_inode(file)); >> + if (!task) >> + return -ESRCH; >> + >> + if (!strcmp(buffer, "noio")) { >> + task->flags |= PF_MEMALLOC_NOIO; >> + } else { >> + rc = -EINVAL; >> + } > > Really? Without any privilege check? So any random user can tap into > __GFP_NOIO allocations? OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since these storage daemons are generally run as root anyway, that would still work for most setup I think. > > NAK. > > I don't think that it's great idea in general to expose this low-level > machinery to userspace. But it's better to get comment from people move > familiar with reclaim path. Any setup with stacked file systems and one of the IO path component being a user level process can benefit from this. See the problem described in this patch I pushed for (unsuccessfully as it was a heavy handed solution): https://www.spinics.net/lists/linux-fsdevel/msg148912.html As the discussion in this thread shows, there is no existing simple solution to deal with this reclaim recursion problem. And automatic detection is too hard, if at all possible. With the proper access rights added, this user accessible interface does look very sensible to me. Best regards. -- Damien Le Moal Western Digital Research