Re: [PATCH] virtio_net: Abort driver initialization if device fails

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 29, 2022 at 12:52 PM Eli Cohen <elic@xxxxxxxxxx> wrote:
>
>
>
> > -----Original Message-----
> > From: Jason Wang <jasowang@xxxxxxxxxx>
> > Sent: Monday, August 29, 2022 3:33 AM
> > To: Michael S. Tsirkin <mst@xxxxxxxxxx>
> > Cc: Eli Cohen <elic@xxxxxxxxxx>; eperezma <eperezma@xxxxxxxxxx>; Si-
> > Wei Liu <si-wei.liu@xxxxxxxxxx>; virtualization <virtualization@lists.linux-
> > foundation.org>
> > Subject: Re: [PATCH] virtio_net: Abort driver initialization if device fails
> >
> > On Mon, Aug 29, 2022 at 4:07 AM Michael S. Tsirkin <mst@xxxxxxxxxx>
> > wrote:
> > >
> > > On Sun, Aug 28, 2022 at 06:48:20PM +0300, Eli Cohen wrote:
> > > > Read the status bit after virtio_device_ready() to check if device
> > > > initialization was successful. If it was not, abort driver
> > > > initialization to avoid further attempts to access device resources.
> > > >
> > > > Abort is required per virtio spec v1.1
> > > >
> > > > 3.1.1
> > > > ...
> > > > If any of these steps go irrecoverably wrong, the driver SHOULD set the
> > > > FAILED status bit to indicate that it has given up on the device (it can
> > > > reset the device later to restart if desired). The driver MUST NOT
> > > > continue initialization in that case.
> > >
> > > I don't see a requirement to read the status bit though
> > > which is what the patch does.
> > >
> > > > This fixes an issue that was discovered when mlx5_vdpa initialization
> > > > failed due to firmware error
> > >
> > > A bit more detail would be nice. which function failed exactly?
> > >
> > > > and subsequent attempts to send control VQ
> > > > commands failed with a call trace:
> > > >
> > > >  watchdog: BUG: soft lockup - CPU#62 stuck for 26s! [systemd-
> > udevd:2610]
> > > >  Modules linked in: virtio_net(+) net_failover failover virtio_vdpa
> > mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib ib_uverbs ib_core mlx5_core
> > mlxfw psample tls pci>
> > > >   drm ghash_clmulni_intel serio_raw usb_storage scsi_transport_sas
> > hpwdt wmi target_core_mod [last unloaded: ib_core]
> > > >  CPU: 62 PID: 2610 Comm: systemd-udevd Tainted: G I        6.0.0-rc2+ #7
> > > >  Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS
> > U30 04/08/2020
> > > >  RIP: 0010:virtnet_send_command+0xfa/0x140 [virtio_net]
> > > >  Code: ec f0 80 e0 85 c0 0f 88 0c 68 00 00 48 8b 7b 08 e8 9b d6 80 e0 84 c0
> > 75 11 eb 43 48 8b 7b 08 e8 7c c3 80 e0 84 c0 75 15 f3 90 <48> 8b 7b 08 48 8d
> > 74 24 >
> > > >  RSP: 0018:ffffb6fba2037b78 EFLAGS: 00000246
> > > >  RAX: 0000000000000000 RBX: ffff99c6043a99c0 RCX: 0000000000000000
> > > >  RDX: 0000000000000000 RSI: ffffb6fba2037b7c RDI: ffff99c6042c6500
> > > >  RBP: ffffb6fba2037bc0 R08: 0000000000000001 R09: 0000000000000000
> > > >  R10: 0000000000000003 R11: 0000000000000002 R12:
> > 0000000000000002
> > > >  R13: 0000000000000004 R14: 0000000000000000 R15: ffff99f634258400
> > > >  FS:  00007f5894eefb40(0000) GS:ffff99dd9fb80000(0000)
> > knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >  CR2: 000055acd4a80000 CR3: 00000018605a4002 CR4:
> > 00000000007706e0
> > > >  DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > > >  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > >  PKRU: 55555554
> > > >  Call Trace:
> > > >   <TASK>
> > > >   _virtnet_set_queues+0x7f/0xc0 [virtio_net]
> >
> > Btw the calltrace seems like the driver is waiting for the completion
> > of mq setting command.
> >
>
> Yes, and driver resources may no be available due to the failure.
>
> > There's some recent discussion of using timeout or interrupt for ctrl
> > command which may help to solve this issue.
> >
>
> That could work but why wait for a timeout if you can detect the failure immediately?

If I understand the calltrace, it is just waiting for the response, so
it doesn't know whether it succeeds or not?

        /* Spin for a response, the kick causes an ioport write, trapping
         * into the hypervisor, so the request should be handled immediately.
         */
        while (!virtqueue_get_buf(vi->cvq, &tmp) &&
               !virtqueue_is_broken(vi->cvq))
                cpu_relax();

Thanks

> After initializing seems a good point in time to check if things went right or wrong.
>
> > Thanks
> >
> > > >   virtnet_probe+0x989/0xae0 [virtio_net]
> > > >   virtio_dev_probe+0x1ab/0x260
> > > >   really_probe+0xde/0x390
> > > >   ? pm_runtime_barrier+0x50/0x90
> > > >   __driver_probe_device+0x78/0x180
> > > >   driver_probe_device+0x1e/0x90
> > > >   __driver_attach+0xc4/0x1e0
> > > >   ? __device_attach_driver+0xe0/0xe0
> > > >   ? __device_attach_driver+0xe0/0xe0
> > > >   bus_for_each_dev+0x61/0x90
> > > >   bus_add_driver+0x1a9/0x200
> > > >   driver_register+0x8f/0xf0
> > > >   ? 0xffffffffc0fdb000
> > > >   virtio_net_driver_init+0x70/0x1000 [virtio_net]
> > > >   do_one_initcall+0x41/0x210
> > > >   ? kmem_cache_alloc_trace+0x16d/0x2c0
> > > >   do_init_module+0x4c/0x1f0
> > > >   __do_sys_finit_module+0x9f/0x100
> > > >   do_syscall_64+0x38/0x90
> > > >   entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > >  RIP: 0033:0x7f589510aecd
> > > >  Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48
> > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01
> > >
> > > >  RSP: 002b:00007ffde61fb398 EFLAGS: 00000246 ORIG_RAX:
> > 0000000000000139
> > > >  RAX: ffffffffffffffda RBX: 000055acd48ac950 RCX: 00007f589510aecd
> > > >  RDX: 0000000000000000 RSI: 00007f589584332c RDI:
> > 0000000000000012
> > > >  RBP: 0000000000020000 R08: 0000000000000000 R09:
> > 0000000000000002
> > > >  R10: 0000000000000012 R11: 0000000000000246 R12:
> > 00007f589584332c
> > > >  R13: 000055acd4867c70 R14: 0000000000000007 R15:
> > 000055acd48980a0
> > > >   </TASK>
> > > >
> > > > Fixes: commit 4baf1e33d084 ("virtio_net: enable VQs early")
> > > > Signed-off-by: Eli Cohen <elic@xxxxxxxxxx>
> > > > ---
> > > >  drivers/net/virtio_net.c | 7 ++++++-
> > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 9cce7dec7366..4698d9a28a6f 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -3900,6 +3900,11 @@ static int virtnet_probe(struct virtio_device
> > *vdev)
> > > >       }
> > > >
> > > >       virtio_device_ready(vdev);
> > > > +     if (vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_FAILED) {
> > > > +             err = -EINVAL;
> > > > +             rtnl_unlock();
> > > > +             goto unregister_ndev;
> > > > +     }
> > > >
> > > >       rtnl_unlock();
> > > >
> > >
> > > I don't get it. What set the failed status?
> > >
> > >
> > > > @@ -3934,7 +3939,7 @@ static int virtnet_probe(struct virtio_device
> > *vdev)
> > > >
> > > >  free_unregister_netdev:
> > > >       virtio_reset_device(vdev);
> > > > -
> > > > +unregister_ndev:
> > > >       unregister_netdev(dev);
> > > >  free_failover:
> > > >       net_failover_destroy(vi->failover);
> > > > --
> > > > 2.35.1
> > >
>

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization



[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux