On Wed, Apr 17, 2024 at 11:03:07AM -0500, Mike Christie wrote: > On 4/16/24 10:50 PM, Jason Wang wrote: > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > >> > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@xxxxxxxxxx> wrote: > >>> > >>> On 4/11/24 10:28 PM, Jason Wang wrote: > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie > >>>> <michael.christie@xxxxxxxxxx> wrote: > >>>>> > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote: > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie > >>>>>> <michael.christie@xxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> The following patches were made over Linus's tree and also apply over > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being > >>>>>>> queued, and prepare the task for an early exit. > >>>>>>> > >>>>>>> This removes the need for the signal/coredump hacks added in: > >>>>>>> > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") > >>>>>>> > >>>>>>> when the vhost_task patches were initially merged and fix the issue > >>>>>>> in this thread: > >>>>>>> > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@xxxxxxxxxx/ > >>>>>>> > >>>>>>> Long Background: > >>>>>>> > >>>>>>> The original vhost worker code didn't support any signals. If the > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/ > >>>>>>> process would exit dropping all references to the device and then the > >>>>>>> file operation's release function would be called. From there we would > >>>>>>> wait on running IO then cleanup the device's memory. > >>>>>> > >>>>>> A dumb question. > >>>>>> > >>>>>> Is this a user space noticeable change? For example, with this series > >>>>>> a SIGKILL may shutdown the datapath ... > >>>>> > >>>>> It already changed in 6.4. We basically added a new interface to shutdown > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown > >>>>> the data path while userspace is still running. We will shutdown everything > >>>>> now if you send a SIGKILL to a vhost worker's thread. > >>>> > >>>> If I understand correctly, for example Qemu can still live is SIGKILL > >>>> is just send to vhost thread. > >>> > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL. > >>> We used kthreads which are special and can ignore it like how userspace > >>> can ignore SIGHUP. > >>> > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of > >>> ignores it like I described below where, the signal is still delivered > >>> to the other qemu threads due to the shared signal handler. Userspace > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the > >>> kernel forces them to exit. > >> > >> Ok, I see, so the reason is that vhost belongs to the same thread > >> group as the owner now. > >> > >>> > >>>> > >>>> If this is correct, guests may detect this (for example virtio-net has > >>>> a watchdog). > >>>> > >>> > >>> What did you mean by that part? Do you mean if the vhost thread were to > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in > >>> the guest (virtio-net driver in the guest kernel) would detect that? > >> > >> I meant this one. But since we are using CLONE_THREAD, we won't see these. > >> > >>> Or > >>> are you saying the watchdog in the guest can detect signals that the > >>> host gets? > >>> > >>> > >>>>> > >>>>> Here are a lots of details: > >>>>> > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and > >>>>> can ignore all signals. > >>>>> > >>>>> You could think of it as the worker is a completely different process than > >>>>> qemu/userspace so they have completely different signal handlers. The > >>>>> vhost worker signal handler ignores all signals even SIGKILL. > >>>> > >>>> Yes. > >>>> > >>>>> > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We > >>>>> don't get to do an explicit close() on the vhost device and we don't get > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit > >>>>> code runs and releases refcounts on the device/file, then the vhost device's > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops > >>>>> the vhost worker. > >>>> > >>>> Right. > >>>> > >>>>> > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker > >>>>> can be thought of as a thread within the userspace process. With that > >>>>> change we have the same signal handler as the userspace process. > >>>>> > >>>>> If you send a SIGKILL to a qemu thread then it works like above. > >>>>> > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of > >>>>> ignores it (that is the hack that I mentioned at the beginning of this > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and > >>>>> then just continue to process works until file_operations->release > >>>>> calls > >>>> > >>>> Yes, so this sticks to the behaviour before vhost_tasks. > >>> > >>> Not exactly. The vhost_task stays alive temporarily. > >>> > >>> The signal is still delivered to the userspace threads and they will > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in > >>> the process and all userspace threads exit like normal because the vhost > >>> task and normal old userspace threads share a signal handler. When > >>> userspace exits, the kernel force drops the refcounts on the vhost > >>> devices and that runs the release function so the vhost_task will then exit. > >>> > >>> So what I'm trying to say is that in 6.4 we already changed the behavior. > >> > >> Yes. To say the truth, it looks even worse but it might be too late to fix. > > > > Andres (cced) has identified two other possible changes: > > > > 1) doesn't run in the global PID namespace but run in the namespace of owner > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting > the patches I mentioned it as a possible fix. I mean I thought we wanted it > to work like qemu and iothreads where the iothread would inherit all those > values automatically. > > At the time, I thought we didn't inherit the namespace, like we did the cgroup, > because there was no kernel function for it (like how we didn't inherit v2 > cgroups until recently when someone added some code for that). > > I don't know if it's allowed to have something like qemu in namespace N but then > have it's children (vhost thread in this case) in the global namespace. I'll > look into it. Yea a big if. > > 2) doesn't inherit kthreadd's scheduling attributes but the owner > > Same as above for this one. I thought I was fixing a bug where before > we had to manually tune the vhost thread's values but for iothreads they > automatically got setup. > > Just to clarify this one. When we used kthreads, kthread() will reset the > scheduler priority for the kthread that's created, so we got the default > values instead of inheriting kthreadd's values. So we would want: > > + sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m); > > in vhost_task_fn() instead of inheriting kthreadd's values. > > > > > Though such a change makes more sense for some use cases, it may break others. > > > > I wonder if we need to introduce a new flag and bring the old kthread > > Do you mean something like a module param? > > > codes if the flag is not set? Then we would not end up trying to align > > the behaviour? > > > > Let me know what you guys prefer. The sched part is easy. The namespace > part might be more difficult, but I will look into it if you want it.