On Wed, May 29, 2019 at 12:28 AM David Howells <dhowells@xxxxxxxxxx> wrote: > Jann Horn <jannh@xxxxxxxxxx> wrote: > > I don't see you setting any special properties on the VMA that would > > prevent userspace from extending its size via mremap() - no > > VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds > > access here? > > Should I just set VM_DONTEXPAND in watch_queue_mmap()? Like so: > > vma->vm_flags |= VM_DONTEXPAND; Yeah, that should work. > > > +static void watch_queue_map_pages(struct vm_fault *vmf, > > > + pgoff_t start_pgoff, pgoff_t end_pgoff) > > ... > > > > Same as above. > > Same solution as above? Or do I need ot check start/end_pgoff too? Same solution. > > > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma) > > > +{ > > > + struct watch_queue *wqueue = file->private_data; > > > + > > > + if (vma->vm_pgoff != 0 || > > > + vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE || > > > + !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED))) > > > + return -EINVAL; > > > > This thing should probably have locking against concurrent > > watch_queue_set_size()? > > Yeah. > > static int watch_queue_mmap(struct file *file, > struct vm_area_struct *vma) > { > struct watch_queue *wqueue = file->private_data; > struct inode *inode = file_inode(file); > u8 nr_pages; > > inode_lock(inode); > nr_pages = wqueue->nr_pages; > inode_unlock(inode); > > if (nr_pages == 0 || > ... > return -EINVAL; Looks reasonable. > > > + smp_store_release(&buf->meta.head, len); > > > > Why is this an smp_store_release()? The entire buffer should not be visible to > > userspace before this setup is complete, right? > > Yes - if I put the above locking in the mmap handler. OTOH, it's a one-off > barrier... > > > > + if (wqueue->buffer) > > > + return -EBUSY; > > > > The preceding check occurs without any locks held and therefore has no > > reliable effect. It should probably be moved below the > > inode_lock(...). > > Yeah, it can race. I'll move it into watch_queue_set_size(). Sounds good. > > > +static void free_watch(struct rcu_head *rcu) > > > +{ > > > + struct watch *watch = container_of(rcu, struct watch, rcu); > > > + > > > + put_watch_queue(rcu_access_pointer(watch->queue)); > > > > This should be rcu_dereference_protected(..., 1). > > That shouldn't be necessary. rcu_access_pointer()'s description says: > > * It is also permissible to use rcu_access_pointer() when read-side > * access to the pointer was removed at least one grace period ago, as > * is the case in the context of the RCU callback that is freeing up > * the data, ... > > It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd > struct should be fine with rcu_access_pointer(). Aah, whoops, you're right, I missed that paragraph in the documentation of rcu_access_pointer(). > > > + /* We don't need the watch list lock for the next bit as RCU is > > > + * protecting everything from being deallocated. > > > > Does "everything" mean "the wqueue" or more than that? > > Actually, just 'wqueue' and its buffer. 'watch' is held by us once we've > dequeued it as we now own the ref 'wlist' had on it. 'wlist' and 'wq' must be > pinned by the caller. > > > > + if (release) { > > > + if (wlist->release_watch) { > > > + rcu_read_unlock(); > > > + /* This might need to call dput(), so > > > + * we have to drop all the locks. > > > + */ > > > + wlist->release_watch(wlist, watch); > > > > How are you holding a reference to `wlist` here? You got the reference through > > rcu_dereference(), you've dropped the RCU read lock, and I don't see anything > > that stabilizes the reference. > > The watch record must hold a ref on the watched object if the watch_list has a > ->release_watch() method. In the code snippet above, the watch record now > belongs to us because we unlinked it under the wlist->lock some lines prior. Ah, of course. > However, you raise a good point, and I think the thing to do is to cache > ->release_watch from it and not pass wlist into (*release_watch)(). We don't > need to concern ourselves with cleaning up *wlist as it will be cleaned up > when the target object is removed. > > Keyrings don't have a ->release_watch method and neither does the block-layer > notification stuff. > > > > + if (wqueue->pages && wqueue->pages[0]) > > > + WARN_ON(page_ref_count(wqueue->pages[0]) != 1); > > > > Is there a reason why there couldn't still be references to the pages > > from get_user_pages()/get_user_pages_fast()? > > I'm not sure. I'm not sure what to do if there are. What do you suggest? I would use put_page() instead of manually freeing it; I think that should be enough? I'm not entirely sure though. > > > + n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID); > > > > Should the non-atomic modification of n->info > > n's an unpublished copy of some userspace data that's local to the function > instance. There shouldn't be any way to race with it at this point. Ah, derp. Yes.