On Wed, Apr 4, 2018 at 4:39 PM, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > On Wed, Apr 04, 2018 at 11:32:54AM +0200, Daniel Vetter wrote: >> So we've done some experiments for the case where the fault originated >> from kernel context (copy_to|from_user and friends). The fixup code seems >> to retry the copy once after the fault (in copy_user_handle_tail), if that >> fails again we get a short read/write. This might result in an -EFAULT, >> short read()/write() or anything else really, depending upon the syscall >> api. >> >> Except in some code paths in gpu drivers where we convert anything into >> -ERESTARTSYS/EINTR if there's a signal pending it won't ever result in the >> syscall getting restarted (well except maybe short read/writes if >> userspace bothers with that). >> >> So I guess gpu fault handlers indeed break the kernel's expectations, but >> then I think we're getting away with that because the inner workings of >> gpu memory objects is all heavily abstracted away by opengl/vulkan and >> friends. >> >> I guess what we could do is try to only do killable sleeps if it's a >> kernel fault, but that means wiring a flag through all the callchains. Not >> pretty. Except when there's a magic set of functions that would convert >> all interruptible sleeps to killable ones only for us. > > I actually have plans to allow mutex_lock_{interruptible,killable} to > return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely > unrelated. Something like this perhaps: > > struct task_struct { > + unsigned int sleep_state; > }; > > static noinline int __sched > -__mutex_lock_interruptible_slowpath(struct mutex *lock) > +__mutex_lock_slowpath(struct mutex *lock, long state) > { > - return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_); > + if (state == TASK_NOBLOCK) > + return -EWOULDBLOCK; > + return __mutex_lock(lock, state, 0, NULL, _RET_IP_); > } > > +int __sched mutex_lock_state(struct mutex *lock, long state) > +{ > + might_sleep(); > + > + if (__mutex_trylock_fast(lock)) > + return 0; > + > + return __mutex_lock_slowpath(lock, state); > +} > +EXPORT_SYMBOL(mutex_lock_state); > > Then the page fault handler can do something like: > > old_state = current->sleep_state; > current->sleep_state = TASK_INTERRUPTIBLE; > ... > current->sleep_state = old_state; > > > This has the page-fault-in-a-signal-handler problem. I don't know if > there's a way to determine if we're already in a signal handler and use > a different sleep_state ...? Not sure what problem you're trying to solve, but I don't think that's the one we have. The only way what we do goes wrong is if the fault originates from kernel context. For faults from the signal handler I think you just get to keep the pieces. Faults form kernel we can detect through FAULT_FLAG_USER. The issue I'm seeing is the following: 1. Some kernel code does copy_*_user, and it points at a gpu mmap region. 2. We fault and go into the gpu driver fault handler. That refuses to insert the pte because a signal is pending (because of all the interruptible waits and locks). 3. Fixup section runs, which afaict tries to do the copy once more using copy_user_handle_tail. 4. We fault again, because the pte is still not present. 5. GPU driver is still refusing to install the pte because signals are pending. 6. Fixup section for copy_user_handle_tail just bails out. 7. copy_*_user returns and indicates that that not all bytes have been copied. 8. syscall (or whatever it is) bails out and returns to userspace, most likely with -EFAULT (but this ofc depends upon the syscall and what it should do when userspace access faults. 9. Signal finally gets handled, but the syscall already failed, and no one will restart it. If userspace is prudent, it might fail (or maybe hit an assert or something). Or maybe I'm confused by your diff, since nothing seems to use current->sleep_state. The problem is also that it's any sleep we do (they all tend to be interruptible, at least when waiting for the gpu or taking any locks that might be held while waiting for the gpu, or anything else that might be blocked waiting for the gpu really). So only patching mutex_lock won't fix this. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch