From: Jacob Pan <jacob.pan@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, January 2, 2025 1:40 PM > > When device memory blocks are hot-added via add_memory_driver_managed(), > the HV balloon driver intercepts them but fails to online these memory > blocks. This could result in device driver initialization failures. > > To address this, fall back to the generic online callback for memory > blocks not added by the HV balloon driver. Similar cases are handled the > same way in virtio-mem driver. I had a little bit of trouble figuring out what problem this patch is solving. Is this a correct description? The Hyper-V balloon driver installs a custom callback for handling page onlining operations performed by the memory hotplug subsystem. This custom callback is global, and overrides the default callback (generic_online_page) that Linux otherwise uses. The custom callback properly handles memory that is hot-added by the balloon driver as part of a Hyper-V hot-add region. But memory can also be hot-added directly by a device driver for a vPCI device, particularly GPUs. In such a case, the custom callback installed by the balloon driver runs, but won't find the page in its hot-add region list and doesn't online it, which could cause driver initialization failures. Fix this by having the balloon custom callback run generic_online_page() when the page isn't part of a Hyper-V hot-add region, thereby doing the default Linux behavior. This allows device driver hot-adds to work properly. Similar cases are handled the same way in the virtio-mem driver. > > Suggested-by: Vikram Sethi <vsethi@xxxxxxxxxx> > Tested-by: Michael Frohlich <mfrohlich@xxxxxxxxxxxxx> > Signed-off-by: Jacob Pan <jacob.pan@xxxxxxxxxxxxxxxxxxx> > --- > drivers/hv/hv_balloon.c | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c > index a99112e6f0b8..c999daf34108 100644 > --- a/drivers/hv/hv_balloon.c > +++ b/drivers/hv/hv_balloon.c > @@ -766,16 +766,18 @@ static void hv_online_page(struct page *pg, unsigned int > order) > struct hv_hotadd_state *has; > unsigned long pfn = page_to_pfn(pg); > > - guard(spinlock_irqsave)(&dm_device.ha_lock); > - list_for_each_entry(has, &dm_device.ha_region_list, list) { > - /* The page belongs to a different HAS. */ > - if (pfn < has->start_pfn || > - (pfn + (1UL << order) > has->end_pfn)) > - continue; > + scoped_guard(spinlock_irqsave, &dm_device.ha_lock) { > + list_for_each_entry(has, &dm_device.ha_region_list, list) { > + /* The page belongs to a different HAS. */ > + if (pfn < has->start_pfn || > + (pfn + (1UL << order) > has->end_pfn)) > + continue; > > - hv_bring_pgs_online(has, pfn, 1UL << order); > - break; > + hv_bring_pgs_online(has, pfn, 1UL << order); > + return; > + } > } > + generic_online_page(pg, order); > } > > static int pfn_covered(unsigned long start_pfn, unsigned long pfn_cnt) > -- > 2.34.1 > The code LGTM. I'd suggest updating the commit message along the lines of what I wrote. My descriptions can be a bit wordy, but hopefully they add clarity on the overall picture. In any case, Reviewed-by: Michael Kelley <mhklinux@xxxxxxxxxxx>