On 04/30/2018 12:25 PM, Eric W. Biederman wrote: > Christian König <ckoenig.leichtzumerken at gmail.com> writes: > >> Hi Eric, >> >> sorry for the late response, was on vacation last week. >> >> Am 26.04.2018 um 02:01 schrieb Eric W. Biederman: >>> Andrey Grodzovsky <Andrey.Grodzovsky at amd.com> writes: >>> >>>> On 04/25/2018 01:17 PM, Oleg Nesterov wrote: >>>>> On 04/25, Andrey Grodzovsky wrote: >>>>>> here (drm_sched_entity_fini) is also a bad idea, but we still want to be >>>>>> able to exit immediately >>>>>> and not wait for GPU jobs completion when the reason for reaching this code >>>>>> is because of KILL >>>>>> signal to the user process who opened the device file. >>>>> Can you hook f_op->flush method? >> THANKS! That sounds like a really good idea to me and we haven't investigated >> into that direction yet. > For the backwards compatibility concerns you cite below the flush method > seems a much better place to introduce the wait. You at least really > will be in a process context for that. Still might be in exit but at > least you will be legitimately be in a process. > >>>> But this one is called for each task releasing a reference to the the file, so >>>> not sure I see how this solves the problem. >>> The big question is why do you need to wait during the final closing a >>> file? >> As always it's because of historical reasons. Initially user space pushed >> commands directly to a hardware queue and when a processes finished we didn't >> need to wait for anything. >> >> Then the GPU scheduler was introduced which delayed pushing the jobs to the >> hardware queue to a later point in time. >> >> This wait was then added to maintain backward compability and not break >> userspace (but see below). > That make sense. > >>> The wait can be terminated so the wait does not appear to be simply a >>> matter of correctness. >> Well when the process is killed we don't care about correctness any more, we >> just want to get rid of it as quickly as possible (OOM situation etc...). >> >> But it is perfectly possible that a process submits some render commands and >> then calls exit() or terminates because of a SIGTERM, SIGINT etc.. In this case >> we need to wait here to make sure that all rendering is pushed to the hardware >> because the scheduler might need resources/settings from the file >> descriptor. >> >> For example if you just remove that wait you could close firefox and get garbage >> on the screen for a millisecond because the remaining rendering commands where >> not executed. >> >> So what we essentially need is to distinct between a SIGKILL (which means stop >> processing as soon as possible) and any other reason because then we don't want >> to annoy the user with garbage on the screen (even if it's just for a few >> milliseconds). > I see a couple of issues. > > - Running the code in release rather than in flush. > > Using flush will catch every close so it should be more backwards > compatible. f_op->flush always runs in process context so looking at > current makes sense. > > - Distinguishing between death by SIGKILL and other process exit deaths. > > In f_op->flush the code can test "((tsk->flags & PF_EXITING) && > (tsk->code == SIGKILL))" to see if it was SIGKILL that terminated > the process. What about Oleg's note not to rely on tsk->code == SIGKILL (still not clear why ?) and that this entire check is racy (against what ?) ? Or is it relevant to .release hook only ? Andrey > > - Dealing with stuck queues (where this patchset came in). > > For stuck queues you are going to need a timeout instead of the current > indefinite wait after PF_EXITING is set. From what you have described a > few milliseconds should be enough. If PF_EXITING is not set you can > still just make the wait killable and skip the timeout if that will give > a better backwards compatible user experience. > > What can't be done is try and catch SIGKILL after a process has called > do_exit. A dead process is a dead process. > > Eric