Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




在 2024/1/26 11:11, Zhu Yanjun 写道:


在 2024/1/22 15:02, Xuan Zhuo 写道:
On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang<jasowang@xxxxxxxxxx>  wrote:
On Mon, Jan 22, 2024 at 2:55 PM Jason Wang<jasowang@xxxxxxxxxx>  wrote:
On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo<xuanzhuo@xxxxxxxxxxxxxxxxx>  wrote:
On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang<jasowang@xxxxxxxxxx>  wrote:
On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo<xuanzhuo@xxxxxxxxxxxxxxxxx>  wrote:
On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang<jasowang@xxxxxxxxxx>  wrote:
On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun<yanjun.zhu@xxxxxxxxx>  wrote:
在 2024/1/20 1:29, Andrew Lunn 写道:
        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
-           !virtqueue_is_broken(vi->cvq))
+           !virtqueue_is_broken(vi->cvq)) {
+        if (timeout)
+            timeout--;
This is not really a timeout, just a loop counter. 200 iterations could
be a very short time on reasonable H/W. I guess this avoid the soft
lockup, but possibly (likely?) breaks the functionality when we need to
loop for some non negligible time.

I fear we need a more complex solution, as mentioned by Micheal in the
thread you quoted.
Got it. I also look forward to the more complex solution to this problem.
Can we add a device capability (new feature bit) such as ctrq_wait_timeout
to get a reasonable timeout?
The usual solution to this is include/linux/iopoll.h. If you can sleep
read_poll_timeout() otherwise read_poll_timeout_atomic().
I read carefully the functions read_poll_timeout() and
read_poll_timeout_atomic(). The timeout is set by the caller of the 2
functions.
FYI, in order to avoid a swtich of atomic or not, we need convert rx
mode setting to workqueue first:

https://www.mail-archive.com/virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx/msg60298.html

As such, can we add a module parameter to customize this timeout value
by the user?
Who is the "user" here, or how can the "user" know the value?

Or this timeout value is stored in device register, virtio_net driver
will read this timeout value at initialization?
See another thread. The design needs to be general, or you can post a RFC.

In another thought, we've already had a tx watchdog, maybe we can have
something similar to cvq and use timeout + reset in that case.
But we may block by the reset ^_^ if the device is broken?
I mean vq reset here.
I see.

I mean when the deivce is broken, the vq reset also many be blocked.

         void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, u16 index)
         {
                 struct virtio_pci_modern_common_cfg __iomem *cfg;

                 cfg = (struct virtio_pci_modern_common_cfg __iomem *)mdev->common;

                 vp_iowrite16(index, &cfg->cfg.queue_select);
                 vp_iowrite16(1, &cfg->queue_reset);

                 while (vp_ioread16(&cfg->queue_reset))
                         msleep(1);

                 while (vp_ioread16(&cfg->cfg.queue_enable))
                         msleep(1);
         }
         EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);

In this function, for the broken device, we can not expect something.
Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.

Thanks

It looks like we have multiple goals here

1) avoid lockups, using workqueue + cond_resched() seems to be
sufficient, it has issue but nothing new
2) recover from the unresponsive device, the issue for timeout is that
it needs to deal with false positives
I agree.

But I want to add a new goal, cvq async. In the netdim, we will
send many requests via the cvq, so the cvq async will be nice.
Then you need an interrupt for cvq.

FYI, I've posted a series that use interrupt for cvq in the past:

https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368621@xxxxxxxxxx/t/
I know this. But the interrupt maybe not a good solution without new space.

Haven't found time in working on this anymore, maybe we can start from
this or not.
I said async, but my aim is to put many requests to the cvq before getting the
response.

Heng Qi posted thishttps://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@xxxxxxxxxxxxxxxxx/

Sorry. This mail is rejected by netdev maillist. So I have to resend it.


Thanks a lot. I read Heng Qi's commits carefully. This patch series are similiar with the NIC feature xmit_more.

But if cvq command is urgent, can we let this urgent cvq command be passed ASAP?

I mean, can we set a flag similar to xmit_more? if cvq command is not urgent, it can be queued. If it is urgent,

this cvq command is passed ASAP.

Zhu Yanjun

Zhu Yanjun

Thanks.


Thanks

Thanks.


Thanks

Thanks.


Thans

Zhu Yanjun

       Andrew




[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux