On Fri, Dec 20, 2013 at 4:53 AM, Benjamin LaHaise <bcrl@xxxxxxxxx> wrote: > > Yes, that's what I found when I started looking into this in detail again. > I think the page reference counting is actually correct. There are 2 > references on each page: the first is from the find_or_create_page() call, > and the second is from the get_user_pages() (which also makes sure the page > is populated into the page tables). Ok, I'm sorry, but that's just pure bullshit then. So it has the page array in the page cache, then mmap's it in, and uses get_user_pages() to get the pages back that it *just* created. This code is pure and utter garbage. It's beyond the pale how crazy it is. Why not just get rid of the idiotic get_user_pages() crap then? Something like the attached patch? Totally untested, but at least it makes *some* amount of sense. Linus
fs/aio.c | 20 +++----------------- 1 file changed, 3 insertions(+), 17 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 6efb7f6cb22e..e1b02dd1be9e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -358,6 +358,8 @@ static int aio_setup_ring(struct kioctx *ctx) SetPageUptodate(page); SetPageDirty(page); unlock_page(page); + + ctx->ring_pages[i] = page; } ctx->aio_ring_file = file; nr_events = (PAGE_SIZE * nr_pages - sizeof(struct aio_ring)) @@ -380,8 +382,8 @@ static int aio_setup_ring(struct kioctx *ctx) ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, 0, &populate); + up_write(&mm->mmap_sem); if (IS_ERR((void *)ctx->mmap_base)) { - up_write(&mm->mmap_sem); ctx->mmap_size = 0; aio_free_ring(ctx); return -EAGAIN; @@ -389,22 +391,6 @@ static int aio_setup_ring(struct kioctx *ctx) pr_debug("mmap address: 0x%08lx\n", ctx->mmap_base); - /* We must do this while still holding mmap_sem for write, as we - * need to be protected against userspace attempting to mremap() - * or munmap() the ring buffer. - */ - ctx->nr_pages = get_user_pages(current, mm, ctx->mmap_base, nr_pages, - 1, 0, ctx->ring_pages, NULL); - - /* Dropping the reference here is safe as the page cache will hold - * onto the pages for us. It is also required so that page migration - * can unmap the pages and get the right reference count. - */ - for (i = 0; i < ctx->nr_pages; i++) - put_page(ctx->ring_pages[i]); - - up_write(&mm->mmap_sem); - if (unlikely(ctx->nr_pages != nr_pages)) { aio_free_ring(ctx); return -EAGAIN;