On Mon, Oct 21, 2019 at 11:24:27AM +0200, Marco Elver wrote: > On Sun, 20 Oct 2019 at 19:30, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote: > > > > (euid, egid) pair is snapshotted correctly from task under RCU, > > but writeback to inode can be done in any order. > > > > Fix by doing writeback under inode->i_lock where necessary > > (/proc/* , /proc/*/fd/* , /proc/*/map_files/* revalidate). > > > > Reported-by: syzbot+e392f8008a294fdf8891@xxxxxxxxxxxxxxxxxxxxxxxxx > > Signed-off-by: Alexey Dobriyan <adobriyan@xxxxxxxxx> > > --- > > Thanks! > > This certainly fixes the problem of inconsistent uid/gid pair due to > racing writebacks, as well as the data-race. If that is the only > purpose of this patch, then from what I see this is fine: > > Acked-by: Marco Elver <elver@xxxxxxxxxx> > > However, there is probably still a more fundamental problem as outlined below. > > > fs/proc/base.c | 25 +++++++++++++++++++++++-- > > fs/proc/fd.c | 2 +- > > fs/proc/internal.h | 2 ++ > > 3 files changed, 26 insertions(+), 3 deletions(-) > > > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -1743,6 +1743,25 @@ void task_dump_owner(struct task_struct *task, umode_t mode, > > *rgid = gid; > > } > > > > +/* use if inode is live */ > > +void task_dump_owner_to_inode(struct task_struct *task, umode_t mode, > > + struct inode *inode) > > +{ > > + kuid_t uid; > > + kgid_t gid; > > + > > + task_dump_owner(task, mode, &uid, &gid); > > + /* > > + * There is no atomic "change all credentials at once" system call, > > + * guaranteeing more than _some_ snapshot from "struct cred" ends up > > + * in inode is not possible. > > + */ > > + spin_lock(&inode->i_lock); > > + inode->i_uid = uid; > > + inode->i_gid = gid; > > + spin_unlock(&inode->i_lock); > > 2 tasks can still race here, and the inconsistent scenario I outlined in > https://lore.kernel.org/linux-fsdevel/000000000000328b2905951a7667@xxxxxxxxxx/ > could still happen I think (although extremely unlikely). Mainly, > causality may still be violated -- but I may be wrong as I don't know > the rest of the code (so please be critical of my suggestion). > > The problem is that if 2 threads race here, one has snapshotted old > uid/gid, and the other the new uid/gid. Then it is still possible for > the old uid/gid to be written back after new uid/gid, which would > result in this bad scenario: > > === TASK 1 === > | seteuid(1000); > | seteuid(0); > | stat("/proc/<pid-of-task-1>", &fstat); > | assert(fstat.st_uid == 0); // fails > === TASK 2 === > | stat("/proc/<pid-of-task-1>", ...); > > AFAIK it's not something that can easily be fixed without some > timestamp on the uid/gid pair (timestamp updated after setuid/seteuid > etc) obtained in the RCU reader critical section. Then in this > critical section, uid/gid should only be written if the current pair > in inode is older according to snapshot timestamp. This probably requires bloating "struct cred" with generation number. I'm not sure what to do other than cc our credential overlords. > > +} > > + > > struct inode *proc_pid_make_inode(struct super_block * sb, > > struct task_struct *task, umode_t mode) > > { > > @@ -1769,6 +1788,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb, > > if (!ei->pid) > > goto out_unlock; > > > > + /* fresh inode -- no races */ > > task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); > > security_task_to_inode(task, inode); > > > > @@ -1802,6 +1822,7 @@ int pid_getattr(const struct path *path, struct kstat *stat, > > */ > > return -ENOENT; > > } > > + /* "struct kstat" is thread local, atomic snapshot is enough */ > > task_dump_owner(task, inode->i_mode, &stat->uid, &stat->gid); > > } > > rcu_read_unlock(); > > @@ -1815,7 +1836,7 @@ int pid_getattr(const struct path *path, struct kstat *stat, > > */ > > void pid_update_inode(struct task_struct *task, struct inode *inode) > > { > > - task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid); > > + task_dump_owner_to_inode(task, inode->i_mode, inode); > > > > inode->i_mode &= ~(S_ISUID | S_ISGID); > > security_task_to_inode(task, inode); > > @@ -1990,7 +2011,7 @@ static int map_files_d_revalidate(struct dentry *dentry, unsigned int flags) > > mmput(mm); > > > > if (exact_vma_exists) { > > - task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); > > + task_dump_owner_to_inode(task, 0, inode); > > > > security_task_to_inode(task, inode); > > status = 1; > > --- a/fs/proc/fd.c > > +++ b/fs/proc/fd.c > > @@ -101,7 +101,7 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode) > > static void tid_fd_update_inode(struct task_struct *task, struct inode *inode, > > fmode_t f_mode) > > { > > - task_dump_owner(task, 0, &inode->i_uid, &inode->i_gid); > > + task_dump_owner_to_inode(task, 0, inode); > > > > if (S_ISLNK(inode->i_mode)) { > > unsigned i_mode = S_IFLNK; > > --- a/fs/proc/internal.h > > +++ b/fs/proc/internal.h > > @@ -123,6 +123,8 @@ static inline struct task_struct *get_proc_task(const struct inode *inode) > > > > void task_dump_owner(struct task_struct *task, umode_t mode, > > kuid_t *ruid, kgid_t *rgid); > > +void task_dump_owner_to_inode(struct task_struct *task, umode_t mode, > > + struct inode *inode); > > > > unsigned name_to_int(const struct qstr *qstr); > > /*