On Tue, 2020-09-29 at 16:06 +0300, Mike Rapoport wrote: > On Tue, Sep 29, 2020 at 04:58:44AM +0000, Edgecombe, Rick P wrote: > > On Thu, 2020-09-24 at 16:29 +0300, Mike Rapoport wrote: > > > Introduce "memfd_secret" system call with the ability to create > > > memory > > > areas visible only in the context of the owning process and not > > > mapped not > > > only to other processes but in the kernel page tables as well. > > > > > > The user will create a file descriptor using the memfd_secret() > > > system call > > > where flags supplied as a parameter to this system call will > > > define > > > the > > > desired protection mode for the memory associated with that file > > > descriptor. > > > > > > Currently there are two protection modes: > > > > > > * exclusive - the memory area is unmapped from the kernel direct > > > map > > > and it > > > is present only in the page tables of the owning > > > mm. > > > > Seems like there were some concerns raised around direct map > > efficiency, but in case you are going to rework this...how does > > this > > memory work for the existing kernel functionality that does things > > like > > this? > > > > get_user_pages(, &page); > > ptr = kmap(page); > > foo = *ptr; > > > > Not sure if I'm missing something, but I think apps could cause the > > kernel to access a not-present page and oops. > > The idea is that this memory should not be accessible by the kernel, > so > the sequence you describe should indeed fail. > > Probably oops would be to noisy and in this case the report needs to > be > less verbose. I was more concerned that it could cause kernel instabilities. I see, so it should not be accessed even at the userspace address? I wonder if it should be prevented somehow then. At least get_user_pages() should be prevented I think. Blocking copy_*_user() access might not be simple. I'm also not so sure that a user would never have any possible reason to copy data from this memory into the kernel, even if it's just convenience. In which case a user setup could break if a specific kernel implementation switched to get_user_pages()/kmap() from using copy_*_user(). So seems maybe a bit thorny without fully blocking access from the kernel, or deprecating that pattern. You should probably call out these "no passing data to/from the kernel" expectations, unless I missed them somewhere.