On Sun, Jan 06, 2019 at 01:46:37PM -0800, Linus Torvalds wrote: > On Sat, Jan 5, 2019 at 5:50 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > Slightly updated patch in case somebody wants to try things out. > > I decided to just apply that patch. It is *not* marked for stable, > very intentionally, because I expect that we will need to wait and see > if there are issues with it, and whether we might have to do something > entirely different (more like the traditional behavior with some extra > "only for owner" logic). So, I read the paper and before I was half way through it I figured there are a bunch of other similar page cache invalidation attacks we can perform without needing mincore. i.e. Focussing on mmap() and mincore() misses the wider issues we have with global shared caches. My first thought: fd = open(some_file, O_RDONLY); iov[0].iov_base = buf; iov[0].iov_len = 1; ret = preadv2(fd, iov, 1, off, RWF_NOWAIT); switch (ret) { case -EAGAIN: /* page not resident in page cache */ break; case 1: /* page resident in page cache */ break; default: /* beyond EOF or some other error */ break; } This is "as documented" in the man page for preadv2: RWF_NOWAIT (since Linux 4.14) Do not wait for data which is not immediately available. If this flag is specified, the preadv2() system call will return instantly if it would have to read data from the backing storage or wait for a lock. If some data was successfully read, it will return the number of bytes read. If no bytes were read, it will return -1 and set errno to EAGAIN. Currently, this flag is meaningful only for preadv2(). IOWs, we've explicitly designed interfaces to communicate whether data is "immediately accessible" or not to the application so they can make sensible decisions about IO scheduling. i.e. IO won't block the main application processing loop and so can be scheduled in the background by the app and the data processed when that IO returns. That just so happens to be exactly the same information about the page cache that mincore is making available to userspace. If we "remove" this information from the interfaces like it has been done for mincore(), it doesn't mean userspace can't get it in other ways. e.g. it now just has to time the read(2) syscall duration and infer whether the data came from the page cache or disk from the timing information. IMO, there's nothing new or novel about this page cache information leak - it was just a matter of time before some researcher put 2 and 2 together and realised that sharing the page cache across a security boundary is little different from sharing deduplicated pages across those same security boundaries. i.e. As long as we shared caches across security boundaries and userspace can control both cache invalidation and instantiation, we cannot prevent userspace from constructing these invalidate+read information exfiltration scenarios. And then there is overlayfs. Overlay is really just a way to efficiently share the page cache of the same underlying read-only directory tree across all containers on a host. i.e. we have been specifically designing our container hosting systems to share the underlying read-only page cache across all security boundaries on the host. If overlay is vulnerable to these shared page cache attacks (seems extremely likely) then we've got much bigger problems than mincore to think about.... > But doing a test patch during the merge window (which is about to > close) sounds like the right thing to do. IMO it seems like the wrong thing to do. It's just a hacky band-aid over a specific extraction method and does nothing to reduce the actual scope of the information leak. Focussing on the band-aid means you've missed all the other avenues that the same information is exposed and all the infrastructure we've build on the core concept of sharing kernel side pages across security boundaries. And that's even without considering whether the change breaks userspace. Which it does. e.g. vmtouch is fairly widely used to manage page cache instantiation for rapid bring-up and migration of guest VMs and containers. They save the hot page cache information from a running container and then using that to instantiate the page cache in new instances running the same workload so they run at full speed right from the start. This use case calls mincore() to pull the page cache information from the running container. If anyone else proposed merging a syscall implementation change that was extremely likely to break userspace you'd be shouting at them that "we don't break userspace".... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx