On Mon, Oct 25, 2021 at 09:00:43PM +0200, Andreas Gruenbacher wrote: > On Fri, Oct 22, 2021 at 9:23 PM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Oct 22, 2021 at 8:06 AM Catalin Marinas <catalin.marinas@xxxxxxx> wrote: > > > Probing only the first byte(s) in fault_in() would be ideal, no need to > > > go through all filesystems and try to change the uaccess/probing order. > > > > Let's try that. Or rather: probing just the first page - since there > > are users like that btrfs ioctl, and the direct-io path. > > For direct I/O, we actually only want to trigger page fault-in so that > we can grab page references with bio_iov_iter_get_pages. Probing for > sub-page error domains will only slow things down. If we hit -EFAULT > during the actual copy-in or copy-out, we know that the error can't be > page fault related. Similarly, in the buffered I/O case, we only > really care about the next byte, so any probing beyond that is > unnecessary. > > So maybe we should split the sub-page error domain probing off from > the fault-in functions. Or at least add an argument to the fault-in > functions that specifies the amount of memory to probe. My preferred option is not to touch fault-in for sub-page faults (though I have some draft patches, they need testing). All this fault-in and uaccess with pagefaults_disabled() is needed to avoid a deadlock when the uaccess fault handling would take the same lock. With sub-page faults, the kernel cannot fix it up anyway, so the arch code won't even attempt call handle_mm_fault() (it is not an mm fault). But the problem is the copy_*_user() etc. API that can only return the number of bytes not copied. That's what I think should be fixed. fault_in() feels like the wrong place to address this when it's not an mm fault. As for fault_in() getting another argument with the amount of sub-page probing to do, I think the API gets even more confusing. I was also thinking, with your patches for fault_in() now returning size_t, is the expectation to be precise in what cannot be copied? We don't have such requirement for copy_*_user(). While more intrusive, I'd rather change copy_page_from_iter_atomic() etc. to take a pointer where to write back an error code. If it's -EFAULT, retry the loop. If it's -EACCES/EPERM just bail out. Or maybe simply a bool set if there was an mm fault to be retried. Yet another option to return an -EAGAIN if it could not process the mm fault due to page faults being disabled. Happy to give this a try, unless there's a strong preference for the fault_in() fix-up (well, I can do both options and post them). -- Catalin