Re: [PATCH] [RFC] virtio: Limit the retries on a virtio device reset

Pierre Morel <pmorel@xxxxxxxxxxxxxxxxxx> · Fri, 25 Aug 2017 13:21:06 +0200

On 25/08/2017 10:26, Cornelia Huck wrote:
On Fri, 25 Aug 2017 00:16:05 +0300
"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:

On Thu, Aug 24, 2017 at 07:07:42PM +0200, Pierre Morel wrote:
- we'll have to spread these tests all over the place.

I counted 19 places where to check if the reset went OK.

None of them touch the device anymore after reset and just free driver's
resources.

... and then hypervisor uses the resources after free. Not good.

hum... yes, no good.
we can not assume anything about the host side at that time anymore.

The only place where we can simply give up on the device and be sure
that nothing bad happens is during initial setup. In the other places,
it seems we have the choice between looping (as now) or panic.

Yes
Note also that the probability that an initial reset works but further 
reset don't is very small.
I think we catch most of the problem during the probe.

IMHO, looping if not in initial setup let the administrator more freedom.
OTOH panic reset to a known state

Problem open: how to recognize an initial setup:
- just make reset return an error and let the driver take the decision
- find a way to let the reset function that this is an initial setup
	- add a state in virtio_device?
	- other?

If we add a state to the virtio_device we could also add other 
information as the reset_retries_count.

So that if reset failed, nothing goes wrong, no device access, but the
probability that the next probe fail is high. (If it ever succeed).

    Allowing reset to fail would be better.

May be I did not understand what you mean.
Testing the flag or a return value is as expensive.

Of course the implementation is a mater of taste.

If a function can fail it should return an error, not just set a flag.

You are right in this case (at least), setting the flag is bad, since 
the flag is set by iowrite8(), it is handled on the device side... 
untrusted in the case we explore.

ccw reset can (a) succeed, (b) fail (by returning an error status via
standard channel subsystem mechanisms), or (c) run into a timeout
(where the driver should recover via csch and friends). [1] In any
case, we need to able to rely on the channel subsystem to make sure a
broken device is dead.

pci-modern reset can (a) succeed, or (b) be in an indeterminate state
(it has not yet completed, but we don't know whether it will complete
in the future). It may be reasonable to just give up if (b) happens
during initial setup.

I'm not sure that allowing to fail reset will help much, other than
allowing to give up on a pci device early.
We have nowhere a sanity check to verify that the virtio transport is in 
order.
The first successful reset can be seen as such a check, independently 
from the virtio transport type.
Give up the device early if the transport is not in order is indeed the 
only thing we can do.

Regards,

Pierre

I notice two other things to do:

- May be adding a warning would be fine too.
- Virtio_ccw may add a fail flag when allocation of CCW failed.
   I did not find anything to do for virtio_mmio or legacy virtio_pci.

Regards,

Pierre

[1] At that point, we either succeed with csch and can either retry or
fail, or fail with device gone, in which case the device is, well,
gone, which we also should be able to handle fine.

--
Pierre Morel
Linux/KVM/QEMU in Böblingen - Germany

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization