Re: [RFC PATCH] Add proc interface to set PF_MEMALLOC flags

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Sep 10, 2019 at 12:05:33PM +0000, Damien Le Moal wrote:
> On 2019/09/10 11:00, Kirill A. Shutemov wrote:
> > On Mon, Sep 09, 2019 at 11:28:04AM -0500, Mike Christie wrote:
> >> There are several storage drivers like dm-multipath, iscsi, and nbd that
> >> have userspace components that can run in the IO path. For example,
> >> iscsi and nbd's userspace deamons may need to recreate a socket and/or
> >> send IO on it, and dm-multipath's daemon multipathd may need to send IO
> >> to figure out the state of paths and re-set them up.
> >>
> >> In the kernel these drivers have access to GFP_NOIO/GFP_NOFS and the
> >> memalloc_*_save/restore functions to control the allocation behavior,
> >> but for userspace we would end up hitting a allocation that ended up
> >> writing data back to the same device we are trying to allocate for.
> >>
> >> This patch allows the userspace deamon to set the PF_MEMALLOC* flags
> >> through procfs. It currently only supports PF_MEMALLOC_NOIO, but
> >> depending on what other drivers and userspace file systems need, for
> >> the final version I can add the other flags for that file or do a file
> >> per flag or just do a memalloc_noio file.
> >>
> >> Signed-off-by: Mike Christie <mchristi@xxxxxxxxxx>
> >> ---
> >>  Documentation/filesystems/proc.txt |  6 ++++
> >>  fs/proc/base.c                     | 53 ++++++++++++++++++++++++++++++
> >>  2 files changed, 59 insertions(+)
> >>
> >> diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
> >> index 99ca040e3f90..b5456a61a013 100644
> >> --- a/Documentation/filesystems/proc.txt
> >> +++ b/Documentation/filesystems/proc.txt
> >> @@ -46,6 +46,7 @@ Table of Contents
> >>    3.10  /proc/<pid>/timerslack_ns - Task timerslack value
> >>    3.11	/proc/<pid>/patch_state - Livepatch patch operation state
> >>    3.12	/proc/<pid>/arch_status - Task architecture specific information
> >> +  3.13  /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >>  
> >>    4	Configuring procfs
> >>    4.1	Mount options
> >> @@ -1980,6 +1981,11 @@ Example
> >>   $ cat /proc/6753/arch_status
> >>   AVX512_elapsed_ms:      8
> >>  
> >> +3.13 /proc/<pid>/memalloc - Control task's memory reclaim behavior
> >> +-----------------------------------------------------------------------
> >> +A value of "noio" indicates that when a task allocates memory it will not
> >> +reclaim memory that requires starting phisical IO.
> >> +
> >>  Description
> >>  -----------
> >>  
> >> diff --git a/fs/proc/base.c b/fs/proc/base.c
> >> index ebea9501afb8..c4faa3464602 100644
> >> --- a/fs/proc/base.c
> >> +++ b/fs/proc/base.c
> >> @@ -1223,6 +1223,57 @@ static const struct file_operations proc_oom_score_adj_operations = {
> >>  	.llseek		= default_llseek,
> >>  };
> >>  
> >> +static ssize_t memalloc_read(struct file *file, char __user *buf, size_t count,
> >> +			     loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	ssize_t rc = 0;
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (task->flags & PF_MEMALLOC_NOIO)
> >> +		rc = simple_read_from_buffer(buf, count, ppos, "noio", 4);
> >> +	put_task_struct(task);
> >> +	return rc;
> >> +}
> >> +
> >> +static ssize_t memalloc_write(struct file *file, const char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct task_struct *task;
> >> +	char buffer[5];
> >> +	int rc = count;
> >> +
> >> +	memset(buffer, 0, sizeof(buffer));
> >> +	if (count != sizeof(buffer) - 1)
> >> +		return -EINVAL;
> >> +
> >> +	if (copy_from_user(buffer, buf, count))
> >> +		return -EFAULT;
> >> +	buffer[count] = '\0';
> >> +
> >> +	task = get_proc_task(file_inode(file));
> >> +	if (!task)
> >> +		return -ESRCH;
> >> +
> >> +	if (!strcmp(buffer, "noio")) {
> >> +		task->flags |= PF_MEMALLOC_NOIO;
> >> +	} else {
> >> +		rc = -EINVAL;
> >> +	}
> > 
> > Really? Without any privilege check? So any random user can tap into
> > __GFP_NOIO allocations?
> 
> OK. It probably should have a test on capable(CAP_SYS_ADMIN) or similar. Since
> these storage daemons are generally run as root anyway, that would still work
> for most setup I think.
> 
> > 
> > NAK.
> > 
> > I don't think that it's great idea in general to expose this low-level
> > machinery to userspace. But it's better to get comment from people move
> > familiar with reclaim path.
> 
> Any setup with stacked file systems and one of the IO path component being a
> user level process can benefit from this. See the problem described in this
> patch I pushed for (unsuccessfully as it was a heavy handed solution):
> https://www.spinics.net/lists/linux-fsdevel/msg148912.html
> 
> As the discussion in this thread shows, there is no existing simple solution to
> deal with this reclaim recursion problem. And automatic detection is too hard,
> if at all possible. With the proper access rights added, this user accessible
> interface does look very sensible to me.

Looking into the thread, have you find out if there's anything on FUSE
side that helps it to avoid deadlocks? Or FUSE just relies on luck with
this?

-- 
 Kirill A. Shutemov



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux