On Fri, 15 Sep 2017, Joe Lawrence wrote: > On 09/14/2017 07:09 PM, Mikulas Patocka wrote: > > On Tue, 5 Sep 2017, Joe Lawrence wrote: > >> pipe_max_size is assigned directly via procfs sysctl: > >> > >> static struct ctl_table fs_table[] = { > >> ... > >> { > >> .procname = "pipe-max-size", > >> .data = &pipe_max_size, > >> .maxlen = sizeof(int), > >> .mode = 0644, > >> .proc_handler = &pipe_proc_fn, > >> .extra1 = &pipe_min_size, > >> }, > >> ... > >> > >> int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf, > >> size_t *lenp, loff_t *ppos) > >> { > >> ... > >> ret = proc_dointvec_minmax(table, write, buf, lenp, ppos) > >> ... > >> > >> and then later rounded in-place a few statements later: > >> > >> ... > >> pipe_max_size = round_pipe_size(pipe_max_size); > >> ... > >> > >> This leaves a window of time between initial assignment and rounding > >> that may be visible to other threads. (For example, one thread sets a > >> non-rounded value to pipe_max_size while another reads its value.) > >> > >> Similar reads of pipe_max_size are potentially racey: > >> > >> pipe.c :: alloc_pipe_info() > >> pipe.c :: pipe_set_size() > >> > >> Protect them and the procfs sysctl assignment with a mutex. > >> > >> Reported-by: Mikulas Patocka <mpatocka@xxxxxxxxxx> > >> Signed-off-by: Joe Lawrence <joe.lawrence@xxxxxxxxxx> > >> --- > >> fs/pipe.c | 28 ++++++++++++++++++++++++---- > >> 1 file changed, 24 insertions(+), 4 deletions(-) > >> > >> diff --git a/fs/pipe.c b/fs/pipe.c > >> index fa28910b3c59..33bb11b0d78e 100644 > >> --- a/fs/pipe.c > >> +++ b/fs/pipe.c > >> @@ -35,6 +35,11 @@ > >> unsigned int pipe_max_size = 1048576; > >> > >> /* > >> + * Provide mutual exclusion around access to pipe_max_size > >> + */ > >> +static DEFINE_MUTEX(pipe_max_mutex); > >> + > >> +/* > >> * Minimum pipe size, as required by POSIX > >> */ > >> unsigned int pipe_min_size = PAGE_SIZE; > >> @@ -623,13 +628,18 @@ struct pipe_inode_info *alloc_pipe_info(void) > >> unsigned long pipe_bufs = PIPE_DEF_BUFFERS; > >> struct user_struct *user = get_current_user(); > >> unsigned long user_bufs; > >> + unsigned int max_size; > >> > >> pipe = kzalloc(sizeof(struct pipe_inode_info), GFP_KERNEL_ACCOUNT); > >> if (pipe == NULL) > >> goto out_free_uid; > >> > >> - if (pipe_bufs * PAGE_SIZE > pipe_max_size && !capable(CAP_SYS_RESOURCE)) > >> - pipe_bufs = pipe_max_size >> PAGE_SHIFT; > >> + mutex_lock(&pipe_max_mutex); > >> + max_size = pipe_max_size; > >> + mutex_unlock(&pipe_max_mutex); > >> + > >> + if (pipe_bufs * PAGE_SIZE > max_size && !capable(CAP_SYS_RESOURCE)) > >> + pipe_bufs = max_size >> PAGE_SHIFT; > >> > >> user_bufs = account_pipe_buffers(user, 0, pipe_bufs); > >> > >> @@ -1039,6 +1049,7 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > >> struct pipe_buffer *bufs; > >> unsigned int size, nr_pages; > >> unsigned long user_bufs; > >> + unsigned int max_size; > >> long ret = 0; > >> > >> size = round_pipe_size(arg); > >> @@ -1056,8 +1067,11 @@ static long pipe_set_size(struct pipe_inode_info *pipe, unsigned long arg) > >> * Decreasing the pipe capacity is always permitted, even > >> * if the user is currently over a limit. > >> */ > >> + mutex_lock(&pipe_max_mutex); > >> + max_size = pipe_max_size; > >> + mutex_unlock(&pipe_max_mutex); > >> if (nr_pages > pipe->buffers && > >> - size > pipe_max_size && !capable(CAP_SYS_RESOURCE)) > >> + size > max_size && !capable(CAP_SYS_RESOURCE)) > >> return -EPERM; > >> > >> user_bufs = account_pipe_buffers(pipe->user, pipe->buffers, nr_pages); > >> @@ -1131,18 +1145,24 @@ int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf, > >> unsigned int rounded_pipe_max_size; > >> int ret; > >> > >> + mutex_lock(&pipe_max_mutex); > >> orig_pipe_max_size = pipe_max_size; > >> ret = proc_dointvec_minmax(table, write, buf, lenp, ppos); > >> - if (ret < 0 || !write) > >> + if (ret < 0 || !write) { > >> + mutex_unlock(&pipe_max_mutex); > >> return ret; > >> + } > >> > >> rounded_pipe_max_size = round_pipe_size(pipe_max_size); > >> if (rounded_pipe_max_size == 0) { > >> pipe_max_size = orig_pipe_max_size; > >> + mutex_unlock(&pipe_max_mutex); > >> return -EINVAL; > >> } > >> > >> pipe_max_size = rounded_pipe_max_size; > >> + mutex_unlock(&pipe_max_mutex); > >> + > >> return ret; > >> } > >> > >> -- > >> 1.8.3.1 > >> > > I think this mutex is too heavy - if multiple processes simultaneously > > create a pipe, the mutex would cause performance degradation. > > > > You can call do_proc_dointvec with a custom callback "conv" function that > > does the rounding of the pipe size value. > > > > Mikulas > > > > Hi Mikulas, > > I'm not strong when it comes to memory barriers, but one of the > side-effects of using the mutex is that pipe_set_size() and > alloc_pipe_info() should have a consistent view of pipe_max_size. > > If I remove the mutex (and assume that I implement a custom > do_proc_dointvec "conv" callback), is it safe for these routines to > directly use pipe_max_size as they had done before? > > If not, is it safe to alias through a temporary stack variable (ie, > could the compiler re-read pipe_max_size multiple times in the function)? > > Would READ_ONCE() help in any way? Theoretically re-reading the variable is possible and you should use ACCESS_ONCE or READ_ONCE+WRITE_ONCE on that variable. In practice, ACCESS_ONCE/READ_ONCE/WRITE_ONCE is missing at a lot of kernel variables that could be modified asynchronously and no one is complaining about it and no one is making any systematic effort to fix it. That re-reading happens (I have some test code that makes the gcc optimizer re-read a variable), but it happens very rarely. Another theoretical problem is that when reading or writing a variable without ACCESS_ONCE, the compiler could read and write the variable using multiple smaller memory accesses. But in practice, it happens only on some non-common architectures. > The mutex covered up some confusion on my part here. > > OTOH, since pipe_max_size is read-only for pipe_set_size() and > alloc_pipe_info() and only updated occasionally by pipe_proc_fn(), would > rw_semaphore or RCU be a fit here? RW semaphore causes cache-line ping-pong between CPUs, it slows down the kernel just like a normal spinlock or mutex. RCU would be useless here (i.e. you don't want to allocate memory and atomically assign it with rcu_assign_pointer). > Regards, > > -- Joe Mikulas