On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > > On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@xxxxxxxxxx> wrote: > > > > On 4/11/24 10:28 PM, Jason Wang wrote: > > > On Fri, Apr 12, 2024 at 12:19 AM Mike Christie > > > <michael.christie@xxxxxxxxxx> wrote: > > >> > > >> On 4/11/24 3:39 AM, Jason Wang wrote: > > >>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie > > >>> <michael.christie@xxxxxxxxxx> wrote: > > >>>> > > >>>> The following patches were made over Linus's tree and also apply over > > >>>> mst's vhost branch. The patches add the ability for vhost_tasks to > > >>>> handle SIGKILL by flushing queued works, stop new works from being > > >>>> queued, and prepare the task for an early exit. > > >>>> > > >>>> This removes the need for the signal/coredump hacks added in: > > >>>> > > >>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") > > >>>> > > >>>> when the vhost_task patches were initially merged and fix the issue > > >>>> in this thread: > > >>>> > > >>>> https://lore.kernel.org/all/000000000000a41b82060e875721@xxxxxxxxxx/ > > >>>> > > >>>> Long Background: > > >>>> > > >>>> The original vhost worker code didn't support any signals. If the > > >>>> userspace application that owned the worker got a SIGKILL, the app/ > > >>>> process would exit dropping all references to the device and then the > > >>>> file operation's release function would be called. From there we would > > >>>> wait on running IO then cleanup the device's memory. > > >>> > > >>> A dumb question. > > >>> > > >>> Is this a user space noticeable change? For example, with this series > > >>> a SIGKILL may shutdown the datapath ... > > >> > > >> It already changed in 6.4. We basically added a new interface to shutdown > > >> everything (userspace and vhost kernel parts). So we won't just shutdown > > >> the data path while userspace is still running. We will shutdown everything > > >> now if you send a SIGKILL to a vhost worker's thread. > > > > > > If I understand correctly, for example Qemu can still live is SIGKILL > > > is just send to vhost thread. > > > > Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL. > > We used kthreads which are special and can ignore it like how userspace > > can ignore SIGHUP. > > > > 6.4 and newer kernels cannot survive. Even if the vhost thread sort of > > ignores it like I described below where, the signal is still delivered > > to the other qemu threads due to the shared signal handler. Userspace > > can't ignore SIGKILL. It doesn't have any say in the matter, and the > > kernel forces them to exit. > > Ok, I see, so the reason is that vhost belongs to the same thread > group as the owner now. > > > > > > > > > If this is correct, guests may detect this (for example virtio-net has > > > a watchdog). > > > > > > > What did you mean by that part? Do you mean if the vhost thread were to > > exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in > > the guest (virtio-net driver in the guest kernel) would detect that? > > I meant this one. But since we are using CLONE_THREAD, we won't see these. > > > Or > > are you saying the watchdog in the guest can detect signals that the > > host gets? > > > > > > >> > > >> Here are a lots of details: > > >> > > >> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal > > >> to a vhost worker, we ignore it. Nothing happens. kthreads are special and > > >> can ignore all signals. > > >> > > >> You could think of it as the worker is a completely different process than > > >> qemu/userspace so they have completely different signal handlers. The > > >> vhost worker signal handler ignores all signals even SIGKILL. > > > > > > Yes. > > > > > >> > > >> If you send a SIGKILL to a qemu thread, then it just exits right away. We > > >> don't get to do an explicit close() on the vhost device and we don't get > > >> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit > > >> code runs and releases refcounts on the device/file, then the vhost device's > > >> file_operations->release function is called. vhost_dev_cleanup then stops > > >> the vhost worker. > > > > > > Right. > > > > > >> > > >> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker > > >> can be thought of as a thread within the userspace process. With that > > >> change we have the same signal handler as the userspace process. > > >> > > >> If you send a SIGKILL to a qemu thread then it works like above. > > >> > > >> If you send a SIGKILL to a vhost worker, the vhost worker still sort of > > >> ignores it (that is the hack that I mentioned at the beginning of this > > >> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and > > >> then just continue to process works until file_operations->release > > >> calls > > > > > > Yes, so this sticks to the behaviour before vhost_tasks. > > > > Not exactly. The vhost_task stays alive temporarily. > > > > The signal is still delivered to the userspace threads and they will > > exit due to getting the SIGKILL also. SIGKILL goes to all the threads in > > the process and all userspace threads exit like normal because the vhost > > task and normal old userspace threads share a signal handler. When > > userspace exits, the kernel force drops the refcounts on the vhost > > devices and that runs the release function so the vhost_task will then exit. > > > > So what I'm trying to say is that in 6.4 we already changed the behavior. > > Yes. To say the truth, it looks even worse but it might be too late to fix. Andres (cced) has identified two other possible changes: 1) doesn't run in the global PID namespace but run in the namespace of owner 2) doesn't inherit kthreadd's scheduling attributes but the owner Though such a change makes more sense for some use cases, it may break others. I wonder if we need to introduce a new flag and bring the old kthread codes if the flag is not set? Then we would not end up trying to align the behaviour? Thanks