From: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> Sent: Tuesday, May 16, 2023 2:12 AM > > Michael Kelley <mikelley@xxxxxxxxxxxxx> writes: > > > vmbus_wait_for_unload() may be called in the panic path after other > > CPUs are stopped. vmbus_wait_for_unload() currently loops through > > online CPUs looking for the UNLOAD response message. But the values of > > CONFIG_KEXEC_CORE and crash_kexec_post_notifiers affect the path used > > to stop the other CPUs, and in one of the paths the stopped CPUs > > are removed from cpu_online_mask. This removal happens in both > > x86/x64 and arm64 architectures. In such a case, vmbus_wait_for_unload() > > only checks the panic'ing CPU, and misses the UNLOAD response message > > except when the panic'ing CPU is CPU 0. vmbus_wait_for_unload() > > eventually times out, but only after waiting 100 seconds. > > > > Fix this by looping through *present* CPUs in vmbus_wait_for_unload(). > > The cpu_present_mask is not modified by stopping the other CPUs in the > > panic path, nor should it be. Furthermore, the synic_message_page > > being checked in vmbus_wait_for_unload() is allocated in > > hv_synic_alloc() for all present CPUs. So looping through the > > present CPUs is more consistent. > > > > For additional safety, also add a check for the message_page being > > NULL before looking for the UNLOAD response message. > > > > Reported-by: John Starks <jostarks@xxxxxxxxxxxxx> > > Fixes: cd95aad55793 ("Drivers: hv: vmbus: handle various crash scenarios") > > I see you Cc:ed stable@ on the patch, should we also add > > Cc: stable@xxxxxxxxxxxxxxx > > here explicitly so it gets picked up by various stable backporting > scritps? I guess Wei can do it when picking the patch to the queue... Yes, the kernel test robot has already warned me about not doing that right. :-( > > > Signed-off-by: Michael Kelley <mikelley@xxxxxxxxxxxxx> > > --- > > drivers/hv/channel_mgmt.c | 10 ++++++++-- > > 1 file changed, 8 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c > > index 007f26d..df2ba20 100644 > > --- a/drivers/hv/channel_mgmt.c > > +++ b/drivers/hv/channel_mgmt.c > > @@ -829,11 +829,14 @@ static void vmbus_wait_for_unload(void) > > if (completion_done(&vmbus_connection.unload_event)) > > goto completed; > > > > - for_each_online_cpu(cpu) { > > + for_each_present_cpu(cpu) { > > struct hv_per_cpu_context *hv_cpu > > = per_cpu_ptr(hv_context.cpu_context, cpu); > > > > page_addr = hv_cpu->synic_message_page; > > + if (!page_addr) > > + continue; > > + > > In theory, synic_message_page for all present CPUs is permanently > assigned in hv_synic_alloc() and we fail the whole thing if any of these > allocations fail so page_addr == NULL is likely impossible today > but there's certainly no harm in having this extra check here, this is > not a hotpath. But consider a CoCo VM where the allocation is not done in hv_synic_alloc(). In this case, synic_message_page is set in hv_synic_enable_regs(), which is called only when a CPU is brought online. If the CPUs that are brought online are less than all present CPUs because of kernel command line options, then we might have synic_message_page values for other present CPUs that don't get initialized and remain NULL. I should probably tweak the commit message to call out this case explicitly. > > > msg = (struct hv_message *)page_addr > > + VMBUS_MESSAGE_SINT; > > > > @@ -867,11 +870,14 @@ static void vmbus_wait_for_unload(void) > > * maybe-pending messages on all CPUs to be able to receive new > > * messages after we reconnect. > > */ > > - for_each_online_cpu(cpu) { > > + for_each_present_cpu(cpu) { > > struct hv_per_cpu_context *hv_cpu > > = per_cpu_ptr(hv_context.cpu_context, cpu); > > > > page_addr = hv_cpu->synic_message_page; > > + if (!page_addr) > > + continue; > > + > > msg = (struct hv_message *)page_addr + VMBUS_MESSAGE_SINT; > > msg->header.message_type = HVMSG_NONE; > > } > > Reviewed-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > Thanks for reviewing!