Hello Tetsuo, On Sat, Aug 08, 2020 at 10:01:21AM +0900, Tetsuo Handa wrote: > use of killable waits disables ability to detect possibility of deadlock (because > lockdep can't check possibility of deadlock which involves actions in userspace), for > syzkaller process is SIGKILLed after 5 seconds while khungtaskd's timeout is 140 seconds. > > If we encounter a deadlock in an unattended operation (e.g. some server process), > we don't have a method for resolving the deadlock. Therefore, I consider that > t->state == TASK_UNINTERRUPTIBLE check is a bad choice. Unless a sleep is neutral > (e.g. no lock is held, or obviously safe to sleep with that specific lock held), > sleeping for 140 seconds inside the kernel is a bad sign even if interruptible/killable. Task in killable state for seconds as result of another task taking too long to do something in kernel sounds bad, if the other task had a legitimate reason to take a long time in normal operations, i.e. like if the other task was just doing an getdents of a large directory. Nobody force any app to use userfaultfd, if an app uses it and the other side of the pipe trusts to read from it, and it gets stuck for seconds in uninterruptible and killable state, it's either an app bug resolvable with kill -9. We also can't enforce all signals to run in presence of other bugs, for example if the task that won't respond to any signal other than CONT and KILL was blocked in stopped state by a buggy SIGSTOP. The pipe also can get stuck if the network is down and it's swapping in from NFS and nobody is forced to take the risk of using network attached storage as swap device either. The hangcheck is currently correct to report a concern, because the other side of the pipe may be another process of another user that cannot SIGKILL the task blocked in the userfault. That sounds far fetched and it's not particular concerning anyway, but it's not technically impossible so I agree with the hangcheck timer reporting an issue that needs correction. However once the mutex is killable there's no concern anymore and the hangcheck timer is correct also not reporting any misbehavior anymore. Instead of userfaultfd, you can think at 100% kernel faults backed by swapin from NFS or swaping from attached network storage or swapin from scsi with a scsi fibre channel accidentally pulled out of a few seconds. It's nice if uffd can survive as well as nfs or scsi would by retrying and waiting more than 1sec. > Can we do something like this? > > bool retried = false; > retry: > lock(); > disable_fault(); > ret = access_memory_that_might_fault(); > enable_fault(); > if (ret == -EWOULDFAULT && !retried) > goto retry_without_lock; > if (ret == 0) > ret = do_something(); > unlock(); > return ret; > retry_without_lock: > unlock(); > ret = access_memory_that_might_fault(); > retried = true; > goto retry; This would work, but it'll make the kernel more complex than using a killable mutex. It'd also give a worse runtime than the killable mutex, if the only source of blocking events while holding the mutex wouldn't be the page fault. With just 2 processes in this case probably it would be fine and there are likely won't be other sources of contention, so the main cons is just the code complexity to be maintained and the fact it won't provide any measurable practical benefit, if something it'll run slower by having to repeat the same fault in blocking and non blocking mode. With regard to the reporting of the hangcheck timer most modern paging code uses killable mutex because unlike the pipe code, there can be other sources of blockage and you don't want to wait for shared resources to unblock a process that is waiting on a mutex. I think trying to reduce the usage of killable mutex overall is a ship that has sailed, it won't move the needle to just avoid it in pipe code since it'll remain everywhere else. So I'm certainly not against your proposal, but if we increase the complexity like above then I'd find it more attractive if it was for some other benefit unrelated to userfaultfd, or swapin from NFS or network attached storage for that matter, and I don't see a big enough benefit to justify it. Thanks! Andrea PS. I'll be busy until Wed sorry if I don't answer promptly to followups. If somebody could give a try to add the killable mutex bailout failure paths that return to userland direct, or your more complex alternative it'd be great.