On Thu, 17 Jan 2019, Linus Torvalds wrote: > > So that seems to deal with mincore() in a reasonable way indeed. > > > > It doesn't unfortunately really solve the preadv2(RWF_NOWAIT), nor does it > > provide any good answer what to do about it, does it? > > As I suggested earlier in the thread, the fix for RWF_NOWAIT might be > to just move the test down to after readahead. So I've done some basic smoke testing (~2 hours of LTP+xfstests) on the kernel with the three topmost patches from https://git.kernel.org/pub/scm/linux/kernel/git/jikos/jikos.git/log/?h=pagecache-sidechannel applied (also attaching to this mail), and no obvious breakage popped up. So if noone sees any principal problem there, I'll happily submit it with proper attribution etc. Thanks, -- Jiri Kosina SUSE Labs
From cbf9381eed6766cff5b05f9d948c1d225cb3d78b Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@xxxxxxx> Date: Wed, 16 Jan 2019 20:51:31 +0100 Subject: [PATCH 1/3] Revert "Change mincore() to count "mapped" pages rather than "cached" pages" This reverts commit 574823bfab82d9d8fa47f422778043fbb4b4f50e. Another aproach (checking file access permissions in order to decide what mincore() should return in order not to leak data) will be implemented instead. Signed-off-by: Jiri Kosina <jkosina@xxxxxxx> --- mm/mincore.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 81 insertions(+), 13 deletions(-) diff --git a/mm/mincore.c b/mm/mincore.c index f0f91461a9f4..218099b5ed31 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -42,14 +42,72 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr, return 0; } -static int mincore_unmapped_range(unsigned long addr, unsigned long end, - struct mm_walk *walk) +/* + * Later we can get more picky about what "in core" means precisely. + * For now, simply check to see if the page is in the page cache, + * and is up to date; i.e. that no page-in operation would be required + * at this time if an application were to map and access this page. + */ +static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff) +{ + unsigned char present = 0; + struct page *page; + + /* + * When tmpfs swaps out a page from a file, any process mapping that + * file will not get a swp_entry_t in its pte, but rather it is like + * any other file mapping (ie. marked !present and faulted in with + * tmpfs's .fault). So swapped out tmpfs mappings are tested here. + */ +#ifdef CONFIG_SWAP + if (shmem_mapping(mapping)) { + page = find_get_entry(mapping, pgoff); + /* + * shmem/tmpfs may return swap: account for swapcache + * page too. + */ + if (xa_is_value(page)) { + swp_entry_t swp = radix_to_swp_entry(page); + page = find_get_page(swap_address_space(swp), + swp_offset(swp)); + } + } else + page = find_get_page(mapping, pgoff); +#else + page = find_get_page(mapping, pgoff); +#endif + if (page) { + present = PageUptodate(page); + put_page(page); + } + + return present; +} + +static int __mincore_unmapped_range(unsigned long addr, unsigned long end, + struct vm_area_struct *vma, unsigned char *vec) { - unsigned char *vec = walk->private; unsigned long nr = (end - addr) >> PAGE_SHIFT; + int i; - memset(vec, 0, nr); - walk->private += nr; + if (vma->vm_file) { + pgoff_t pgoff; + + pgoff = linear_page_index(vma, addr); + for (i = 0; i < nr; i++, pgoff++) + vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + } else { + for (i = 0; i < nr; i++) + vec[i] = 0; + } + return nr; +} + +static int mincore_unmapped_range(unsigned long addr, unsigned long end, + struct mm_walk *walk) +{ + walk->private += __mincore_unmapped_range(addr, end, + walk->vma, walk->private); return 0; } @@ -69,9 +127,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, goto out; } - /* We'll consider a THP page under construction to be there */ if (pmd_trans_unstable(pmd)) { - memset(vec, 1, nr); + __mincore_unmapped_range(addr, end, vma, vec); goto out; } @@ -80,17 +137,28 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pte_t pte = *ptep; if (pte_none(pte)) - *vec = 0; + __mincore_unmapped_range(addr, addr + PAGE_SIZE, + vma, vec); else if (pte_present(pte)) *vec = 1; else { /* pte is a swap entry */ swp_entry_t entry = pte_to_swp_entry(pte); - /* - * migration or hwpoison entries are always - * uptodate - */ - *vec = !!non_swap_entry(entry); + if (non_swap_entry(entry)) { + /* + * migration or hwpoison entries are always + * uptodate + */ + *vec = 1; + } else { +#ifdef CONFIG_SWAP + *vec = mincore_page(swap_address_space(entry), + swp_offset(entry)); +#else + WARN_ON(1); + *vec = 1; +#endif + } } vec++; } -- 2.12.3
From ca02a026f40dfaebc29c29edd9c992a0ff10075e Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@xxxxxxx> Date: Wed, 16 Jan 2019 20:53:17 +0100 Subject: [PATCH 2/3] mm/mincore: make mincore() more conservative The semantics of what mincore() considers to be resident is not completely clearar, but Linux has always (since 2.3.52, which is when mincore() was initially done) treated it as "page is available in page cache". That's potentially a problem, as that [in]directly exposes meta-information about pagecache / memory mapping state even about memory not strictly belonging to the process executing the syscall, opening possibilities for sidechannel attacks. Change the semantics of mincore() so that it only reveals pagecache information for non-anonymous mappings that belog to files that the calling process could (if it tried to) successfully open for writing. Originally-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Originally-by: Dominique Martinet <asmadeus@xxxxxxxxxxxxx> Signed-off-by: Jiri Kosina <jkosina@xxxxxxx> --- mm/mincore.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/mm/mincore.c b/mm/mincore.c index 218099b5ed31..11ed7064f4eb 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -169,6 +169,13 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return 0; } +static inline bool can_do_mincore(struct vm_area_struct *vma) +{ + return vma_is_anonymous(vma) + || (vma->vm_file && (vma->vm_file->f_mode & FMODE_WRITE)) + || inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0; +} + /* * Do a chunk of "sys_mincore()". We've already checked * all the arguments, we hold the mmap semaphore: we should @@ -189,8 +196,13 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v vma = find_vma(current->mm, addr); if (!vma || addr < vma->vm_start) return -ENOMEM; - mincore_walk.mm = vma->vm_mm; end = min(vma->vm_end, addr + (pages << PAGE_SHIFT)); + if (!can_do_mincore(vma)) { + unsigned long pages = (end - addr) >> PAGE_SHIFT; + memset(vec, 1, pages); + return pages; + } + mincore_walk.mm = vma->vm_mm; err = walk_page_range(addr, end, &mincore_walk); if (err < 0) return err; -- 2.12.3
From e7765f317afb193adb4ba00d81251686191cbf4b Mon Sep 17 00:00:00 2001 From: Jiri Kosina <jkosina@xxxxxxx> Date: Wed, 16 Jan 2019 21:06:58 +0100 Subject: [PATCH 3/3] mm/filemap: initiate readahead even if IOCB_NOWAIT is set for the I/O preadv2(RWF_NOWAIT) can be used to open a side-channel to pagecache contents, as it reveals metadata about residency of pages in pagecache. If preadv2(RWF_NOWAIT) returns immediately, it provides a clear "page not resident" information, and vice versa. Close that sidechannel by always initiating readahead on the cache if we encounter a cache miss for preadv2(RWF_NOWAIT); with that in place, probing the pagecache residency itself will actually populate the cache, making the sidechannel useless. Originally-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Signed-off-by: Jiri Kosina <jkosina@xxxxxxx> --- mm/filemap.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 9f5e323e883e..7bcdd36e629d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2075,8 +2075,6 @@ static ssize_t generic_file_buffered_read(struct kiocb *iocb, page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & IOCB_NOWAIT) - goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); -- 2.12.3