Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting

michael.christie@xxxxxxxxxx · Fri, 12 Apr 2024 11:52:48 -0500

On 4/11/24 10:28 PM, Jason Wang wrote:
> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> <michael.christie@xxxxxxxxxx> wrote:
>>
>> On 4/11/24 3:39 AM, Jason Wang wrote:
>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
>>> <michael.christie@xxxxxxxxxx> wrote:
>>>>
>>>> The following patches were made over Linus's tree and also apply over
>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
>>>> handle SIGKILL by flushing queued works, stop new works from being
>>>> queued, and prepare the task for an early exit.
>>>>
>>>> This removes the need for the signal/coredump hacks added in:
>>>>
>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
>>>>
>>>> when the vhost_task patches were initially merged and fix the issue
>>>> in this thread:
>>>>
>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@xxxxxxxxxx/
>>>>
>>>> Long Background:
>>>>
>>>> The original vhost worker code didn't support any signals. If the
>>>> userspace application that owned the worker got a SIGKILL, the app/
>>>> process would exit dropping all references to the device and then the
>>>> file operation's release function would be called. From there we would
>>>> wait on running IO then cleanup the device's memory.
>>>
>>> A dumb question.
>>>
>>> Is this a user space noticeable change? For example, with this series
>>> a SIGKILL may shutdown the datapath ...
>>
>> It already changed in 6.4. We basically added a new interface to shutdown
>> everything (userspace and vhost kernel parts). So we won't just shutdown
>> the data path while userspace is still running. We will shutdown everything
>> now if you send a SIGKILL to a vhost worker's thread.
> 
> If I understand correctly, for example Qemu can still live is SIGKILL
> is just send to vhost thread.

Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
We used kthreads which are special and can ignore it like how userspace
can ignore SIGHUP.

6.4 and newer kernels cannot survive. Even if the vhost thread sort of
ignores it like I described below where, the signal is still delivered
to the other qemu threads due to the shared signal handler. Userspace
can't ignore SIGKILL. It doesn't have any say in the matter, and the
kernel forces them to exit.

> 
> If this is correct, guests may detect this (for example virtio-net has
> a watchdog).
> 

What did you mean by that part? Do you mean if the vhost thread were to
exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
the guest (virtio-net driver in the guest kernel) would detect that? Or
are you saying the watchdog in the guest can detect signals that the
host gets?

>>
>> Here are a lots of details:
>>
>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
>> can ignore all signals.
>>
>> You could think of it as the worker is a completely different process than
>> qemu/userspace so they have completely different signal handlers. The
>> vhost worker signal handler ignores all signals even SIGKILL.
> 
> Yes.
> 
>>
>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
>> don't get to do an explicit close() on the vhost device and we don't get
>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
>> code runs and releases refcounts on the device/file, then the vhost device's
>> file_operations->release function is called. vhost_dev_cleanup then stops
>> the vhost worker.
> 
> Right.
> 
>>
>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
>> can be thought of as a thread within the userspace process. With that
>> change we have the same signal handler as the userspace process.
>>
>> If you send a SIGKILL to a qemu thread then it works like above.
>>
>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
>> ignores it (that is the hack that I mentioned at the beginning of this
>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
>> then just continue to process works until file_operations->release
>> calls
> 
> Yes, so this sticks to the behaviour before vhost_tasks.

Not exactly. The vhost_task stays alive temporarily.

The signal is still delivered to the userspace threads and they will
exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
the process and all userspace threads exit like normal because the vhost
task and normal old userspace threads share a signal handler. When
userspace exits, the kernel force drops the refcounts on the vhost
devices and that runs the release function so the vhost_task will then exit.

So what I'm trying to say is that in 6.4 we already changed the behavior.

> 
>>
>> However, the change in behavior is that because the worker is just a
>> thread within qemu, qemu is going to exit since they share a signal
>> handler and userspace can't ignore SIGKILL.
>> We then run perform the
>> steps above like in the pre-6.4 kernel description as if you sent a
>> SIGKILL directly to a userspace thread.
>>
>> - With the patches in this thread there is no major difference in behavior
>> with 6.4 and newer kernels. We might exit a little faster. Instead of the
>> vhost thread trying to do it's hacked up version of ignoring the signal
>> and waiting for userspace to exit and call file_operations->release, the
>> vhost worker thread will exit after it flushes works and stops new ones.
>> So now, you can have the vhost thread exiting in parallel with the
>> userspace thread.
> 
> Everything would be fine if Qemu wanted to quit but I meant there
> could be a case where SIGKILL was sent to the vhost thread but not
> Qemu.

Yeah, in the last mail, I was trying to say that behavior already
changed in 6.4. In 6.4 we basically added a new interface to the vhost
layer that uses signals/SIGKILL to tell the vhost layer and userspace to
completely shutdown.

This patchset is just removing the hacks Eric and Oleg allowed for us in
6.4. That code where we do our hacked up version of ignoring signals
(clear it and just go on) combined with the signals/coredump related
hacks done in this patch:

commit f9010dbdce911ee1f1af1398a24b1f9f992e0080
Author: Mike Christie <michael.christie@xxxxxxxxxx>
Date:   Thu Jun 1 13:32:32 2023 -0500

    fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

is causing this bug:

https://lore.kernel.org/all/000000000000a41b82060e875721@xxxxxxxxxx/

Basically, as Eric mentions in that thread the signal/kernel developers
considered the signal/coredump hacks were meant to be temp, and we were
supposed to convert to work more like io_uring's io_worker_exit. That's
what I'm doing in this patchset. Patch 9/9 in this set then removes the
hacks from the core kernel/signal code.