On Sat, Dec 9, 2023 at 6:42 PM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote: > > On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote: > > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote: > > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote: > > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote: > > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote: > > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled > > > > > > > > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on > > > > > > virtio whatsoever, so these are just educated guesses that maybe > > > > > > someone can verify/correct. Please bear with me probably messing up > > > > > > the terminology. > > > > > > > > > > > > - vhost is looping through available queues. > > > > > > - vhost wants to wake up a kworker to process a found queue. > > > > > > - kworker does something with that queue and terminates quickly. > > > > > > > > > > > > What I found by throwing in some very noisy trace statements was that, > > > > > > if the kworker is not woken up, the vhost just keeps looping accross > > > > > > all available queues (and seems to repeat itself). So it essentially > > > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise > > > > > > it will just keep on looping until it is migrated off the CPU. > > > > > > > > > > > > > > > Normally it takes the buffers off the queue and is done with it. > > > > > I am guessing that at the same time guest is running on some other > > > > > CPU and keeps adding available buffers? > > > > > > > > > > > > > It seems to do just that, there are multiple other vhost instances > > > > involved which might keep filling up thoses queues. > > > > > > > > > > No vhost is ever only draining queues. Guest is filling them. > > > > > > > Unfortunately, this makes the problematic vhost instance to stay on > > > > the CPU and prevents said kworker to get scheduled. The kworker is > > > > explicitly woken up by vhost, so it wants it to do something. It looks to me vhost doesn't use workqueue but the worker by itself. > > > > > > > > At this point it seems that there is an assumption about the scheduler > > > > in place which is no longer fulfilled by EEVDF. From the discussion so > > > > far, it seems like EEVDF does what is intended to do. > > > > > > > > Shouldn't there be a more explicit mechanism in use that allows the > > > > kworker to be scheduled in favor of the vhost? Vhost did a brunch of copy_from_user() which should trigger __might_fault() so a __might_sleep() most of the case. > > > > > > > > It is also concerning that the vhost seems cannot be preempted by the > > > > scheduler while executing that loop. > > > > > > > > > Which loop is that, exactly? > > > > The loop continously passes translate_desc in drivers/vhost/vhost.c > > That's where I put the trace statements. > > > > The overall sequence seems to be (top to bottom): > > > > handle_rx > > get_rx_bufs > > vhost_get_vq_desc > > vhost_get_avail_head > > vhost_get_avail > > __vhost_get_user_slow > > translate_desc << trace statement in here > > vhost_iotlb_itree_first > > I wonder why do you keep missing cache and re-translating. > Is pr_debug enabled for you? If not could you check if it > outputs anything? > Or you can tweak: > > #define vq_err(vq, fmt, ...) do { \ > pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \ > if ((vq)->error_ctx) \ > eventfd_signal((vq)->error_ctx, 1);\ > } while (0) > > to do pr_err if you prefer. > > > These functions show up as having increased overhead in perf. > > > > There are multiple loops going on in there. > > Again the disclaimer though, I'm not familiar with that code at all. > > > So there's a limit there: vhost_exceeds_weight should requeue work: > > } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len))); > > then we invoke scheduler each time before re-executing it: > > > { > struct vhost_worker *worker = data; > struct vhost_work *work, *work_next; > struct llist_node *node; > > node = llist_del_all(&worker->work_list); > if (node) { > __set_current_state(TASK_RUNNING); > > node = llist_reverse_order(node); > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > kcov_remote_start_common(worker->kcov_handle); > work->fn(work); > kcov_remote_stop(); > cond_resched(); > } > } > > return !!node; > } > > These are the byte and packet limits: > > /* Max number of bytes transferred before requeueing the job. > * Using this limit prevents one virtqueue from starving others. */ > #define VHOST_NET_WEIGHT 0x80000 > > /* Max number of packets transferred before requeueing the job. > * Using this limit prevents one virtqueue from starving others with small > * pkts. > */ > #define VHOST_NET_PKT_WEIGHT 256 > > > Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any? Or a dirty hack like cond_resched() in translate_desc(). Thanks > > -- > MST >