On 3/26/21 9:11 AM, Stefan Metzmacher wrote: > Am 26.03.21 um 16:10 schrieb Jens Axboe: >> On 3/26/21 9:08 AM, Stefan Metzmacher wrote: >>> Am 26.03.21 um 15:55 schrieb Jens Axboe: >>>> On 3/26/21 8:53 AM, Jens Axboe wrote: >>>>> On 3/26/21 8:45 AM, Stefan Metzmacher wrote: >>>>>> Am 26.03.21 um 15:43 schrieb Stefan Metzmacher: >>>>>>> Am 26.03.21 um 15:38 schrieb Jens Axboe: >>>>>>>> On 3/26/21 7:59 AM, Jens Axboe wrote: >>>>>>>>> On 3/26/21 7:54 AM, Jens Axboe wrote: >>>>>>>>>>> The KILL after STOP deadlock still exists. >>>>>>>>>> >>>>>>>>>> In which tree? Sounds like you're still on the old one with that >>>>>>>>>> incremental you sent, which wasn't complete. >>>>>>>>>> >>>>>>>>>>> Does io_wq_manager() exits without cleaning up on SIGKILL? >>>>>>>>>> >>>>>>>>>> No, it should kill up in all cases. I'll try your stop + kill, I just >>>>>>>>>> tested both of them separately and didn't observe anything. I also ran >>>>>>>>>> your io_uring-cp example (and found a bug in the example, fixed and >>>>>>>>>> pushed), fwiw. >>>>>>>>> >>>>>>>>> I can reproduce this one! I'll take a closer look. >>>>>>>> >>>>>>>> OK, that one is actually pretty straight forward - we rely on cleaning >>>>>>>> up on exit, but for fatal cases, get_signal() will call do_exit() for us >>>>>>>> and never return. So we might need a special case in there to deal with >>>>>>>> that, or some other way of ensuring that fatal signal gets processed >>>>>>>> correctly for IO threads. >>>>>>> >>>>>>> And if (fatal_signal_pending(current)) doesn't prevent get_signal() from being called? >>>>>> >>>>>> Ah, we're still in the first get_signal() from SIGSTOP, correct? >>>>> >>>>> Yes exactly, we're waiting in there being stopped. So we either need to >>>>> check to something ala: >>>>> >>>>> relock: >>>>> + if (current->flags & PF_IO_WORKER && fatal_signal_pending(current)) >>>>> + return false; >>>>> >>>>> to catch it upfront and from the relock case, or add: >>>>> >>>>> fatal: >>>>> + if (current->flags & PF_IO_WORKER) >>>>> + return false; >>>>> >>>>> to catch it in the fatal section. >>>> >>>> Can you try this? Not crazy about adding a special case, but I don't >>>> think there's any way around this one. And should be pretty cheap, as >>>> we're already pulling in ->flags right above anyway. >>>> >>>> diff --git a/kernel/signal.c b/kernel/signal.c >>>> index 5ad8566534e7..5b75fbe3d2d6 100644 >>>> --- a/kernel/signal.c >>>> +++ b/kernel/signal.c >>>> @@ -2752,6 +2752,15 @@ bool get_signal(struct ksignal *ksig) >>>> */ >>>> current->flags |= PF_SIGNALED; >>>> >>>> + /* >>>> + * PF_IO_WORKER threads will catch and exit on fatal signals >>>> + * themselves. They have cleanup that must be performed, so >>>> + * we cannot call do_exit() on their behalf. coredumps also >>>> + * do not apply to them. >>>> + */ >>>> + if (current->flags & PF_IO_WORKER) >>>> + return false; >>>> + >>>> if (sig_kernel_coredump(signr)) { >>>> if (print_fatal_signals) >>>> print_fatal_signal(ksig->info.si_signo); >>>> >>> >>> I guess not before next week, but if it resolves the problem for you, >>> I guess it would be good to get this into rc5. >> >> It does, I pushed out a new branch. I'll send out a v2 series in a bit. > > Great, thanks! > > Any chance to get the "cmdline" hiding included? I'll take a look at your response there, haven't yet. Wanted to get this one sorted first. -- Jens Axboe