> From: Michael Kelley (EOSG) > Sent: Monday, May 28, 2018 17:19 > > While this patch solves the immediate problem of getting hung waiting > for a response from Hyper-V that will never come, there's another scenario > to look at that I think introduces a race. Suppose the guest VM issues a > vmbus_sendpacket() request in one of the cases covered by this patch, > and suppose that Hyper-V queues a response to the request, and then > immediately follows with a rescind request. Processing the response will > get queued to a tasklet associated with the channel, while processing the > rescind will get queued to a tasklet associated with the top-level vmbus > connection. From what I can see, the code doesn't impose any ordering > on processing the two. If the rescind is processed first, the new > wait_for_response() function may wake up, notice the rescind flag, and > return an error. Its caller will return an error, and in doing so pop the > completion packet off the stack. When the response is processed later, > it will try to signal completion via a completion packet that no longer > exists, and memory corruption will likely occur. > > Am I missing anything that would prevent this scenario from happening? > It is admittedly low probability, and a solution seems non-trivial. I haven't > looked specifically, but a similar scenario is probably possible with the > drivers for other VMbus devices. We should work on a generic solution. > > Michael Thanks for spotting the race! IMO we can disable the per-channel tasklet to exclude the race: --- a/drivers/pci/host/pci-hyperv.c +++ b/drivers/pci/host/pci-hyperv.c @@ -565,6 +565,7 @@ static int wait_for_response(struct hv_device *hdev, { while (true) { if (hdev->channel->rescind) { + tasklet_disable(&hdev->channel->callback_event); dev_warn_once(&hdev->device, "The device is gone.\n"); return -ENODEV; } This way, when we exit the loop, we're sure hv_pci_onchannelcallback() can not run anymore. What do you think of this? It looks the list of the other vmbus devices that can be hot-removed is: the hv_utils devices hv_sock devices storvsc device netvsc device As I checked, the first 3 types of devices don't have this "send a request to the host and wait for the response forever" pattern. NetVSC should be fixed as it has the same pattern. -- Dexuan _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel