RE: [PATCH 1/2] Drivers: hv: vmbus: Wait for offers during boot

Michael Kelley <mhklinux@xxxxxxxxxxx> · Mon, 21 Oct 2024 04:33:37 +0000

From: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx> Sent: Friday, October 18, 2024 4:58 AM
> 
> Channels offers are requested during vmbus initialization and resume
> from hibernation. Add support to wait for all channel offers to be
> delivered and processed before returning from vmbus_request_offers.
> This is to support user mode (VTL0) in OpenHCL (A Linux based
> paravisor for Confidential VMs) to ensure that all channel offers
> are present when init begins in VTL0, and it knows which channels
> the host has offered and which it has not.
> 
> This is in analogy to a PCI bus not returning from probe until it has
> scanned all devices on the bus.
> 
> Without this, user mode can race with vmbus initialization and miss
> channel offers. User mode has no way to work around this other than
> sleeping for a while, since there is no way to know when vmbus has
> finished processing offers.
> 
> With this added functionality, remove earlier logic which keeps track
> of count of offered channels post resume from hibernation. Once all
> offers delivered message is received, no further offers are going to
> be received. Consequently, logic to prevent suspend from happening
> after previous resume had missing offers, is also removed.
> 
> Co-developed-by: John Starks <jostarks@xxxxxxxxxxxxx>
> Signed-off-by: John Starks <jostarks@xxxxxxxxxxxxx>
> Signed-off-by: Naman Jain <namjain@xxxxxxxxxxxxxxxxxxx>
> ---
>  drivers/hv/channel_mgmt.c | 38 +++++++++++++++++++++++---------------
>  drivers/hv/connection.c   |  4 ++--
>  drivers/hv/hyperv_vmbus.h | 14 +++-----------
>  drivers/hv/vmbus_drv.c    | 16 ----------------
>  4 files changed, 28 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 3c6011a48dab..ac514805dafe 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -944,16 +944,6 @@ void vmbus_initiate_unload(bool crash)
>  		vmbus_wait_for_unload();
>  }
> 
> -static void check_ready_for_resume_event(void)
> -{
> -	/*
> -	 * If all the old primary channels have been fixed up, then it's safe
> -	 * to resume.
> -	 */
> -	if (atomic_dec_and_test(&vmbus_connection.nr_chan_fixup_on_resume))
> -		complete(&vmbus_connection.ready_for_resume_event);
> -}
> -
>  static void vmbus_setup_channel_state(struct vmbus_channel *channel,
>  				      struct vmbus_channel_offer_channel *offer)
>  {
> @@ -1109,8 +1099,6 @@ static void vmbus_onoffer(struct vmbus_channel_message_header *hdr)
> 
>  		/* Add the channel back to the array of channels. */
>  		vmbus_channel_map_relid(oldchannel);
> -		check_ready_for_resume_event();
> -
>  		mutex_unlock(&vmbus_connection.channel_mutex);
>  		return;
>  	}
> @@ -1297,12 +1285,11 @@
> EXPORT_SYMBOL_GPL(vmbus_hvsock_device_unregister);
>  /*
>   * vmbus_onoffers_delivered -
>   * This is invoked when all offers have been delivered.
> - *
> - * Nothing to do here.
>   */

I'm unclear on the meaning of the ALLOFFERS_DELIVERED message. In the
early days of Hyper-V, the set of virtual devices in a VM was fixed before a
VM started, so presumably ALLOFFERS_DELIVERED meant that offers for
that fixed set of devices had been delivered. That meaning makes sense
conceptually.

But more recent versions of Hyper-V support adding and removing devices
at any time during the life of the VM. If a device is added, a new offer is
generated. Existing devices (such as netvsc) can reconfigure their channels,
in which case new subchannel offers are generated. So the core concept of
"all offers delivered" isn't valid anymore.

To date Linux hasn't done anything with this message, so we didn't need
precision about what it means. There's no VMBus specification or
documentation, so we need some text here in the comments that
explains precisely what ALLOFFERS_DELIVERED means, and how that
meaning aligns with the fact that offers can be delivered anytime during
the life of the VM.

I'll note that in the past there's also been some behavior where during guest
boot, Hyper-V offers a virtual PCI device to the guest, immediately
rescinds the vPCI device (before Linux even finished processing the offer),
then offers it again. I wonder about how ALLOFFERS_DELIVERED interacts
with that situation.

I ran some tests in an Azure L16s_v3 VM, which has three vPCI devices:
2 NVMe devices and 1 Mellanox CX-5. The offers for the 2 NVMe devices
came before the ALLOFFERS_DELIVERED message, but the offer for
the Mellanox CX-5 came *after* the ALLOFFERS_DELIVERED message.
If that's the way the Mellanox CX-5 offers work, this patch breaks things
in the hibernation resume path because ALLOFFERS_DELIVERED isn't
sufficient to indicate that all primary channels have been re-offered
and the resume can proceed. All sub-channel offers came after the
ALLOFFERS_DELIVERED message, but that is what I expected and
shouldn't be a problem.

It's hard for me to review this patch set without some precision around
what ALLOFFERS_DELIVERED means.

>  static void vmbus_onoffers_delivered(
>  			struct vmbus_channel_message_header *hdr)
>  {
> +	complete(&vmbus_connection.all_offers_delivered_event);
>  }
> 
>  /*
> @@ -1578,7 +1565,8 @@ void vmbus_onmessage(struct vmbus_channel_message_header *hdr)
>  }
> 
>  /*
> - * vmbus_request_offers - Send a request to get all our pending offers.
> + * vmbus_request_offers - Send a request to get all our pending offers
> + * and wait for all offers to arrive.
>   */
>  int vmbus_request_offers(void)
>  {
> @@ -1596,6 +1584,10 @@ int vmbus_request_offers(void)
> 
>  	msg->msgtype = CHANNELMSG_REQUESTOFFERS;
> 
> +	/*
> +	 * This REQUESTOFFERS message will result in the host sending an all
> +	 * offers delivered message.
> +	 */
>  	ret = vmbus_post_msg(msg, sizeof(struct vmbus_channel_message_header),
>  			     true);
> 
> @@ -1607,6 +1599,22 @@ int vmbus_request_offers(void)
>  		goto cleanup;
>  	}
> 
> +	/* Wait for the host to send all offers. */
> +	while (wait_for_completion_timeout(
> +		&vmbus_connection.all_offers_delivered_event, msecs_to_jiffies(10 * 1000)) == 0) {

Maybe do !wait_for_completion_timeout( ...) instead of explicitly testing
for 0? I see that some existing code uses the explicit test for 0, but that's
not the usual kernel code idiom.

> +		pr_warn("timed out waiting for all offers to be delivered...\n");
> +	}

The while loop could continue forever. We don't want to trust the Hyper-V
host to be well-behaved, so there should be an uber timeout where it gives
up after 100 seconds (or pick some value). If the uber timeout is reached,
I'm not sure if the code should panic or just go on -- it's debatable. But doing
something explicit is probably better than repeatedly outputting the
warning message forever.

> +
> +	/*
> +	 * Flush handling of offer messages (which may initiate work on
> +	 * other work queues).
> +	 */
> +	flush_workqueue(vmbus_connection.work_queue);
> +
> +	/* Flush processing the incoming offers. */
> +	flush_workqueue(vmbus_connection.handle_primary_chan_wq);
> +	flush_workqueue(vmbus_connection.handle_sub_chan_wq);

Why does the sub-channel workqueue need to be flushed? Sub-channels
get created at the request of the Linux driver, in cooperation with the VSP
on the Hyper-V side. If the top-level goal is for user-space drivers to be
able to know that at least a core set of offers have been processed, it
seems like waiting for sub-channel offers isn't necessary. And new
subchannel offers could arrive immediately after this flush, so the flush
doesn't provide any guarantee about whether there are offers in that
workqueue. If there is a valid reason to wait, some explanatory
comments would be helpful.

Michael

> +
>  cleanup:
>  	kfree(msginfo);
> 
> diff --git a/drivers/hv/connection.c b/drivers/hv/connection.c
> index f001ae880e1d..8351360bba16 100644
> --- a/drivers/hv/connection.c
> +++ b/drivers/hv/connection.c
> @@ -34,8 +34,8 @@ struct vmbus_connection vmbus_connection = {
> 
>  	.ready_for_suspend_event = COMPLETION_INITIALIZER(
>  				  vmbus_connection.ready_for_suspend_event),
> -	.ready_for_resume_event	= COMPLETION_INITIALIZER(
> -				  vmbus_connection.ready_for_resume_event),
> +	.all_offers_delivered_event = COMPLETION_INITIALIZER(
> +				  vmbus_connection.all_offers_delivered_event),
>  };
>  EXPORT_SYMBOL_GPL(vmbus_connection);
> 
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index d2856023d53c..80cc65dac740 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -287,18 +287,10 @@ struct vmbus_connection {
>  	struct completion ready_for_suspend_event;
> 
>  	/*
> -	 * The number of primary channels that should be "fixed up"
> -	 * upon resume: these channels are re-offered upon resume, and some
> -	 * fields of the channel offers (i.e. child_relid and connection_id)
> -	 * can change, so the old offermsg must be fixed up, before the resume
> -	 * callbacks of the VSC drivers start to further touch the channels.
> +	 * Completed once the host has offered all channels. Note that
> +	 * some channels may still be being process on a work queue.
>  	 */
> -	atomic_t nr_chan_fixup_on_resume;
> -	/*
> -	 * vmbus_bus_resume() waits for "nr_chan_fixup_on_resume" to
> -	 * drop to zero.
> -	 */
> -	struct completion ready_for_resume_event;
> +	struct completion all_offers_delivered_event;
>  };
> 
> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 9b15f7daf505..bd3fc41dc06b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2427,11 +2427,6 @@ static int vmbus_bus_suspend(struct device *dev)
>  	if (atomic_read(&vmbus_connection.nr_chan_close_on_suspend) > 0)
>  		wait_for_completion(&vmbus_connection.ready_for_suspend_event);
> 
> -	if (atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) != 0) {
> -		pr_err("Can not suspend due to a previous failed resuming\n");
> -		return -EBUSY;
> -	}
> -
>  	mutex_lock(&vmbus_connection.channel_mutex);
> 
>  	list_for_each_entry(channel, &vmbus_connection.chn_list, listentry) {
> @@ -2456,17 +2451,12 @@ static int vmbus_bus_suspend(struct device *dev)
>  			pr_err("Sub-channel not deleted!\n");
>  			WARN_ON_ONCE(1);
>  		}
> -
> -		atomic_inc(&vmbus_connection.nr_chan_fixup_on_resume);
>  	}
> 
>  	mutex_unlock(&vmbus_connection.channel_mutex);
> 
>  	vmbus_initiate_unload(false);
> 
> -	/* Reset the event for the next resume. */
> -	reinit_completion(&vmbus_connection.ready_for_resume_event);
> -
>  	return 0;
>  }
> 
> @@ -2502,14 +2492,8 @@ static int vmbus_bus_resume(struct device *dev)
>  	if (ret != 0)
>  		return ret;
> 
> -	WARN_ON(atomic_read(&vmbus_connection.nr_chan_fixup_on_resume) == 0);
> -
>  	vmbus_request_offers();
> 
> -	if (wait_for_completion_timeout(
> -		&vmbus_connection.ready_for_resume_event, 10 * HZ) == 0)
> -		pr_err("Some vmbus device is missing after suspending?\n");
> -
>  	/* Reset the event for the next suspend. */
>  	reinit_completion(&vmbus_connection.ready_for_suspend_event);
> 
> --
> 2.34.1
>