On Fri, Dec 08, 2023 at 12:41:38PM +0100, Tobias Huschle wrote: > On Fri, Dec 08, 2023 at 05:31:18AM -0500, Michael S. Tsirkin wrote: > > On Fri, Dec 08, 2023 at 10:24:16AM +0100, Tobias Huschle wrote: > > > On Thu, Dec 07, 2023 at 01:48:40AM -0500, Michael S. Tsirkin wrote: > > > > On Thu, Dec 07, 2023 at 07:22:12AM +0100, Tobias Huschle wrote: > > > > > 3. vhost looping endlessly, waiting for kworker to be scheduled > > > > > > > > > > I dug a little deeper on what the vhost is doing. I'm not an expert on > > > > > virtio whatsoever, so these are just educated guesses that maybe > > > > > someone can verify/correct. Please bear with me probably messing up > > > > > the terminology. > > > > > > > > > > - vhost is looping through available queues. > > > > > - vhost wants to wake up a kworker to process a found queue. > > > > > - kworker does something with that queue and terminates quickly. > > > > > > > > > > What I found by throwing in some very noisy trace statements was that, > > > > > if the kworker is not woken up, the vhost just keeps looping accross > > > > > all available queues (and seems to repeat itself). So it essentially > > > > > relies on the scheduler to schedule the kworker fast enough. Otherwise > > > > > it will just keep on looping until it is migrated off the CPU. > > > > > > > > > > > > Normally it takes the buffers off the queue and is done with it. > > > > I am guessing that at the same time guest is running on some other > > > > CPU and keeps adding available buffers? > > > > > > > > > > It seems to do just that, there are multiple other vhost instances > > > involved which might keep filling up thoses queues. > > > > > > > No vhost is ever only draining queues. Guest is filling them. > > > > > Unfortunately, this makes the problematic vhost instance to stay on > > > the CPU and prevents said kworker to get scheduled. The kworker is > > > explicitly woken up by vhost, so it wants it to do something. > > > > > > At this point it seems that there is an assumption about the scheduler > > > in place which is no longer fulfilled by EEVDF. From the discussion so > > > far, it seems like EEVDF does what is intended to do. > > > > > > Shouldn't there be a more explicit mechanism in use that allows the > > > kworker to be scheduled in favor of the vhost? > > > > > > It is also concerning that the vhost seems cannot be preempted by the > > > scheduler while executing that loop. > > > > > > Which loop is that, exactly? > > The loop continously passes translate_desc in drivers/vhost/vhost.c > That's where I put the trace statements. > > The overall sequence seems to be (top to bottom): > > handle_rx > get_rx_bufs > vhost_get_vq_desc > vhost_get_avail_head > vhost_get_avail > __vhost_get_user_slow > translate_desc << trace statement in here > vhost_iotlb_itree_first I wonder why do you keep missing cache and re-translating. Is pr_debug enabled for you? If not could you check if it outputs anything? Or you can tweak: #define vq_err(vq, fmt, ...) do { \ pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \ if ((vq)->error_ctx) \ eventfd_signal((vq)->error_ctx, 1);\ } while (0) to do pr_err if you prefer. > These functions show up as having increased overhead in perf. > > There are multiple loops going on in there. > Again the disclaimer though, I'm not familiar with that code at all. So there's a limit there: vhost_exceeds_weight should requeue work: } while (likely(!vhost_exceeds_weight(vq, ++recv_pkts, total_len))); then we invoke scheduler each time before re-executing it: { struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; node = llist_del_all(&worker->work_list); if (node) { __set_current_state(TASK_RUNNING); node = llist_reverse_order(node); /* make sure flag is seen after deletion */ smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); kcov_remote_start_common(worker->kcov_handle); work->fn(work); kcov_remote_stop(); cond_resched(); } } return !!node; } These are the byte and packet limits: /* Max number of bytes transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x80000 /* Max number of packets transferred before requeueing the job. * Using this limit prevents one virtqueue from starving others with small * pkts. */ #define VHOST_NET_PKT_WEIGHT 256 Try reducing the VHOST_NET_WEIGHT limit and see if that improves things any? -- MST