Re: [PATCH 0/9] vhost: Support SIGKILL by flushing and exiting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Not a kernel developer, so sorry if this here is pointless spam.

But qemu is not the only thing consuming /dev/vhost-net:
https://doc.dpdk.org/guides/howto/virtio_user_as_exception_path.html

Before the series of patches around
https://github.com/torvalds/linux/commit/e297cd54b3f81d652456ae6cb93941fc6b5c6683,
you would run a DPDK application inside namespaces of container
"blue", but then the vhost-... threads were spawned as children of
kthreadd and run in the global namespaces. That seemed counter
intuitive, and what's worse, the DPDK application running inside the
"blue" container namespaces cannot see and thus cannot modify
scheduling attributes and affinity of its own vhost-... threads. In
scenarios with pretty strict isolation/pinning requirements, the
vhost-... threads could easily be scheduled on the same CPUs as the
DPDK poll mode drivers (because they inherit the calling process' CPU
affinity), and there's no way for the DPDK process itself to move the
vhost-... threads to other CPUs. Also, if the DPDK process ran as
SCHED_RR, the vhost-.... thread would still be SCHED_NORMAL.

After the patch series, the fact that the vhost-... threads run as
tasks of the process that requests them seems more natural and gives
us as users the kind of control that we'd want from within the
container to modify the vhost-... threads' scheduling and affinity.
The vhost-... thread as a child of the DPDK application inherits the
same scheduling class, CPU set, etc and the DPDK process can easily
change those attributes.

However, if another user was used to the old behavior, and their
entire tooling was created for the old behavior (imagine someone wrote
a bunch of scripts/services to force affinity from the global PID
namespace for those vhost-... threads), and now this change was
introduced, it would break their tooling. So because _I_ might fancy
the new behavior, but user _B_ might be all set up for the old
kthreadd, shouldn't there be a flag or configuration for the user to:

if (new_flag)
    vhost_task_create()
else
    kthread_create()

I don't know what kind of flag or knob I'd expect here, but it could
be granular for the calling process (a new syscall?), or a kernel
flag, etc. But something that let's the admin choose how the kernel
spawns the vhost-... threads?

On Thu, Apr 18, 2024 at 9:07 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:
>
> On Thu, Apr 18, 2024 at 12:08:52PM +0800, Jason Wang wrote:
> > On Thu, Apr 18, 2024 at 12:10 AM Mike Christie
> > <michael.christie@xxxxxxxxxx> wrote:
> > >
> > > On 4/16/24 10:50 PM, Jason Wang wrote:
> > > > On Mon, Apr 15, 2024 at 4:52 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > >>
> > > >> On Sat, Apr 13, 2024 at 12:53 AM <michael.christie@xxxxxxxxxx> wrote:
> > > >>>
> > > >>> On 4/11/24 10:28 PM, Jason Wang wrote:
> > > >>>> On Fri, Apr 12, 2024 at 12:19 AM Mike Christie
> > > >>>> <michael.christie@xxxxxxxxxx> wrote:
> > > >>>>>
> > > >>>>> On 4/11/24 3:39 AM, Jason Wang wrote:
> > > >>>>>> On Sat, Mar 16, 2024 at 8:47 AM Mike Christie
> > > >>>>>> <michael.christie@xxxxxxxxxx> wrote:
> > > >>>>>>>
> > > >>>>>>> The following patches were made over Linus's tree and also apply over
> > > >>>>>>> mst's vhost branch. The patches add the ability for vhost_tasks to
> > > >>>>>>> handle SIGKILL by flushing queued works, stop new works from being
> > > >>>>>>> queued, and prepare the task for an early exit.
> > > >>>>>>>
> > > >>>>>>> This removes the need for the signal/coredump hacks added in:
> > > >>>>>>>
> > > >>>>>>> Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression")
> > > >>>>>>>
> > > >>>>>>> when the vhost_task patches were initially merged and fix the issue
> > > >>>>>>> in this thread:
> > > >>>>>>>
> > > >>>>>>> https://lore.kernel.org/all/000000000000a41b82060e875721@xxxxxxxxxx/
> > > >>>>>>>
> > > >>>>>>> Long Background:
> > > >>>>>>>
> > > >>>>>>> The original vhost worker code didn't support any signals. If the
> > > >>>>>>> userspace application that owned the worker got a SIGKILL, the app/
> > > >>>>>>> process would exit dropping all references to the device and then the
> > > >>>>>>> file operation's release function would be called. From there we would
> > > >>>>>>> wait on running IO then cleanup the device's memory.
> > > >>>>>>
> > > >>>>>> A dumb question.
> > > >>>>>>
> > > >>>>>> Is this a user space noticeable change? For example, with this series
> > > >>>>>> a SIGKILL may shutdown the datapath ...
> > > >>>>>
> > > >>>>> It already changed in 6.4. We basically added a new interface to shutdown
> > > >>>>> everything (userspace and vhost kernel parts). So we won't just shutdown
> > > >>>>> the data path while userspace is still running. We will shutdown everything
> > > >>>>> now if you send a SIGKILL to a vhost worker's thread.
> > > >>>>
> > > >>>> If I understand correctly, for example Qemu can still live is SIGKILL
> > > >>>> is just send to vhost thread.
> > > >>>
> > > >>> Pre-6.4 qemu could still survive if only the vhost thread got a SIGKILL.
> > > >>> We used kthreads which are special and can ignore it like how userspace
> > > >>> can ignore SIGHUP.
> > > >>>
> > > >>> 6.4 and newer kernels cannot survive. Even if the vhost thread sort of
> > > >>> ignores it like I described below where, the signal is still delivered
> > > >>> to the other qemu threads due to the shared signal handler. Userspace
> > > >>> can't ignore SIGKILL. It doesn't have any say in the matter, and the
> > > >>> kernel forces them to exit.
> > > >>
> > > >> Ok, I see, so the reason is that vhost belongs to the same thread
> > > >> group as the owner now.
> > > >>
> > > >>>
> > > >>>>
> > > >>>> If this is correct, guests may detect this (for example virtio-net has
> > > >>>> a watchdog).
> > > >>>>
> > > >>>
> > > >>> What did you mean by that part? Do you mean if the vhost thread were to
> > > >>> exit, so drivers/vhost/net.c couldn't process IO, then the watchdog in
> > > >>> the guest (virtio-net driver in the guest kernel) would detect that?
> > > >>
> > > >> I meant this one. But since we are using CLONE_THREAD, we won't see these.
> > > >>
> > > >>> Or
> > > >>> are you saying the watchdog in the guest can detect signals that the
> > > >>> host gets?
> > > >>>
> > > >>>
> > > >>>>>
> > > >>>>> Here are a lots of details:
> > > >>>>>
> > > >>>>> - Pre-6.4 kernel, when vhost workers used kthreads, if you sent any signal
> > > >>>>> to a vhost worker, we ignore it. Nothing happens. kthreads are special and
> > > >>>>> can ignore all signals.
> > > >>>>>
> > > >>>>> You could think of it as the worker is a completely different process than
> > > >>>>> qemu/userspace so they have completely different signal handlers. The
> > > >>>>> vhost worker signal handler ignores all signals even SIGKILL.
> > > >>>>
> > > >>>> Yes.
> > > >>>>
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread, then it just exits right away. We
> > > >>>>> don't get to do an explicit close() on the vhost device and we don't get
> > > >>>>> to do ioctls like VHOST_NET_SET_BACKEND to clear backends. The kernel exit
> > > >>>>> code runs and releases refcounts on the device/file, then the vhost device's
> > > >>>>> file_operations->release function is called. vhost_dev_cleanup then stops
> > > >>>>> the vhost worker.
> > > >>>>
> > > >>>> Right.
> > > >>>>
> > > >>>>>
> > > >>>>> - In 6.4 and newer kernels, vhost workers use vhost_tasks, so the worker
> > > >>>>> can be thought of as a thread within the userspace process. With that
> > > >>>>> change we have the same signal handler as the userspace process.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a qemu thread then it works like above.
> > > >>>>>
> > > >>>>> If you send a SIGKILL to a vhost worker, the vhost worker still sort of
> > > >>>>> ignores it (that is the hack that I mentioned at the beginning of this
> > > >>>>> thread). kernel/vhost_task.c:vhost_task_fn will see the signal and
> > > >>>>> then just continue to process works until file_operations->release
> > > >>>>> calls
> > > >>>>
> > > >>>> Yes, so this sticks to the behaviour before vhost_tasks.
> > > >>>
> > > >>> Not exactly. The vhost_task stays alive temporarily.
> > > >>>
> > > >>> The signal is still delivered to the userspace threads and they will
> > > >>> exit due to getting the SIGKILL also. SIGKILL goes to all the threads in
> > > >>> the process and all userspace threads exit like normal because the vhost
> > > >>> task and normal old userspace threads share a signal handler. When
> > > >>> userspace exits, the kernel force drops the refcounts on the vhost
> > > >>> devices and that runs the release function so the vhost_task will then exit.
> > > >>>
> > > >>> So what I'm trying to say is that in 6.4 we already changed the behavior.
> > > >>
> > > >> Yes. To say the truth, it looks even worse but it might be too late to fix.
> > > >
> > > > Andres (cced) has identified two other possible changes:
> > > >
> > > > 1) doesn't run in the global PID namespace but run in the namespace of owner
> > >
> > > Yeah, I mentioned that one in vhost.h like it's a feature and when posting
> > > the patches I mentioned it as a possible fix. I mean I thought we wanted it
> > > to work like qemu and iothreads where the iothread would inherit all those
> > > values automatically.
> >
> > Right, but it could be noticed by the userspace, especially for the
> > one that tries to do tweak on the performance.
> >
> > The root cause is the that now we do copy_processs() in the process of
> > Qemu instead of the kthreadd. Which result of the the differences of
> > namespace (I think PID namespace is not the only one we see
> > difference) and others for the vhost task.
>
> Leaking things out of a namespace looks more like a bug.
> If you really have to be pedantic, the thing to add would
> be a namespace flag not a qemu flag. Userspace running inside
> a namespace really must have no say about whether to leak
> info out of it.
>
> > >
> > > At the time, I thought we didn't inherit the namespace, like we did the cgroup,
> > > because there was no kernel function for it (like how we didn't inherit v2
> > > cgroups until recently when someone added some code for that).
> > >
> > > I don't know if it's allowed to have something like qemu in namespace N but then
> > > have it's children (vhost thread in this case) in the global namespace.
> > > I'll
> > > look into it.
> >
> > Instead of moving vhost thread between difference namespaces, I wonder
> > if the following is simpler:
> >
> > if (new_flag)
> >     vhost_task_create()
> > else
> >     kthread_create()
> >
> > New flag inherits the attributes of Qemu (namespaces, rlimit, cgroup,
> > scheduling attributes ...) which is what we want. Without the new
> > flag, we stick exactly to the behaviour as in the past to unbreak
> > existing userspace.
> >
> > >
> > > > 2) doesn't inherit kthreadd's scheduling attributes but the owner
> > >
> > > Same as above for this one. I thought I was fixing a bug where before
> > > we had to manually tune the vhost thread's values but for iothreads they
> > > automatically got setup.
> > >
> > > Just to clarify this one. When we used kthreads, kthread() will reset the
> > > scheduler priority for the kthread that's created, so we got the default
> > > values instead of inheriting kthreadd's values.  So we would want:
> > >
> > > +       sched_setscheduler_nocheck(current, SCHED_NORMAL, &param);
> > >
> > > in vhost_task_fn() instead of inheriting kthreadd's values.
> > >
> > > >
> > > > Though such a change makes more sense for some use cases, it may break others.
> > > >
> > > > I wonder if we need to introduce a new flag and bring the old kthread
> > >
> > > Do you mean something like a module param?
> >
> > This requires the management layer to know if it has a new user space
> > or not which is hard. A better place is to introduce backend features.
> >
> > >
> > > > codes if the flag is not set? Then we would not end up trying to align
> > > > the behaviour?
> > > >
> > >
> > > Let me know what you guys prefer. The sched part is easy. The namespace
> > > part might be more difficult, but I will look into it if you want it.
> >
> > Thanks a lot. I think it would be better to have the namespace part
> > (as well as other namespaces) then we don't need to answer hard
> > questions like if it can break user space or not.
> >
> > >
>






[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux