RE: [PATCH v2 2/2] hv_balloon: Enable hot-add for memblock sizes > 128 MiB

Michael Kelley <mhklinux@xxxxxxxxxxx> · Thu, 2 May 2024 22:40:44 +0000

From: David Hildenbrand <david@xxxxxxxxxx> Sent: Thursday, May 2, 2024 12:17 AM
> 
> On 01.05.24 17:14, mhkelley58@xxxxxxxxx wrote:
> > From: Michael Kelley <mhklinux@xxxxxxxxxxx>
> >
> > The Hyper-V balloon driver supports hot-add of memory in addition
> > to ballooning. Current code hot-adds in fixed size chunks of
> > 128 MiB (fixed constant HA_CHUNK in the code). While this works
> > in Hyper-V VMs with 64 GiB or less or memory where the Linux
> > memblock size is 128 MiB, the hot-add fails for larger memblock
> > sizes because add_memory() expects memory to be added in chunks
> > that match the memblock size. Messages like the following are
> > reported when Linux has a 256 MiB memblock size:
> >
> > [  312.668859] Block size [0x10000000] unaligned hotplug range:
> >                 start 0x310000000, size 0x8000000
> > [  312.668880] hv_balloon: hot_add memory failed error is -22
> > [  312.668984] hv_balloon: Memory hot add failed
> >
> > Larger memblock sizes are usually used in VMs with more than
> > 64 GiB of memory, depending on the alignment of the VM's
> > physical address space.
> >
> > Fix this problem by having the Hyper-V balloon driver determine
> > the Linux memblock size, and process hot-add requests in that
> > chunk size instead of a fixed 128 MiB. Also update the hot-add
> > alignment requested of the Hyper-V host to match the memblock
> > size.
> >
> > The code changes look significant, but in fact are just a
> > simple text substitution of a new global variable for the
> > previous HA_CHUNK constant. No algorithms are changed except
> > to initialize the new global variable and to calculate the
> > alignment value to pass to Hyper-V. Testing with memblock
> > sizes of 256 MiB and 2 GiB shows correct operation.
> >
> > Signed-off-by: Michael Kelley <mhklinux@xxxxxxxxxxx>
> > ---
> > Changes in v2:
> > * Change new global variable name from ha_chunk_pgs to
> >    ha_pages_in_chunk [David Hildenbrand]
> > * Use kernel macros ALIGN(), ALIGN_DOWN(), and umin()
> >    to simplify code and reduce references to HA_CHUNK. For
> >    ease of review, this is done in a new patch preceeding
> >    this one. [David Hildenbrand]
> >
> >   drivers/hv/hv_balloon.c | 55 +++++++++++++++++++++++++----------------
> >   1 file changed, 34 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> > index 9f45b8a6762c..e0a1a18041ca 100644
> > --- a/drivers/hv/hv_balloon.c
> > +++ b/drivers/hv/hv_balloon.c
> > @@ -425,11 +425,11 @@ struct dm_info_msg {
> >    * The range start_pfn : end_pfn specifies the range
> >    * that the host has asked us to hot add. The range
> >    * start_pfn : ha_end_pfn specifies the range that we have
> > - * currently hot added. We hot add in multiples of 128M
> > - * chunks; it is possible that we may not be able to bring
> > - * online all the pages in the region. The range
> > + * currently hot added. We hot add in chunks equal to the
> > + * memory block size; it is possible that we may not be able
> > + * to bring online all the pages in the region. The range
> >    * covered_start_pfn:covered_end_pfn defines the pages that can
> > - * be brough online.
> > + * be brought online.
> >    */
> >
> >   struct hv_hotadd_state {
> > @@ -505,8 +505,9 @@ enum hv_dm_state {
> >
> >   static __u8 recv_buffer[HV_HYP_PAGE_SIZE];
> >   static __u8 balloon_up_send_buffer[HV_HYP_PAGE_SIZE];
> > +static unsigned long ha_pages_in_chunk;
> > +
> >   #define PAGES_IN_2M (2 * 1024 * 1024 / PAGE_SIZE)
> > -#define HA_CHUNK (128 * 1024 * 1024 / PAGE_SIZE)
> >
> >   struct hv_dynmem_device {
> >   	struct hv_device *dev;
> > @@ -724,21 +725,21 @@ static void hv_mem_hot_add(unsigned long start,
> unsigned long size,
> >   	unsigned long processed_pfn;
> >   	unsigned long total_pfn = pfn_count;
> >
> > -	for (i = 0; i < (size/HA_CHUNK); i++) {
> > -		start_pfn = start + (i * HA_CHUNK);
> > +	for (i = 0; i < (size/ha_pages_in_chunk); i++) {
> > +		start_pfn = start + (i * ha_pages_in_chunk);
> >
> >   		scoped_guard(spinlock_irqsave, &dm_device.ha_lock) {
> > -			has->ha_end_pfn +=  HA_CHUNK;
> > -			processed_pfn = umin(total_pfn, HA_CHUNK);
> > +			has->ha_end_pfn += ha_pages_in_chunk;
> > +			processed_pfn = umin(total_pfn, ha_pages_in_chunk);
> >   			total_pfn -= processed_pfn;
> > -			has->covered_end_pfn +=  processed_pfn;
> > +			has->covered_end_pfn += processed_pfn;
> >   		}
> >
> >   		reinit_completion(&dm_device.ol_waitevent);
> >
> >   		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
> >   		ret = add_memory(nid, PFN_PHYS((start_pfn)),
> > -				(HA_CHUNK << PAGE_SHIFT), MHP_MERGE_RESOURCE);
> > +				(ha_pages_in_chunk << PAGE_SHIFT), MHP_MERGE_RESOURCE);
> >
> 
> HA_BYTES_IN_CHUNK might be reasonable to have (see below)
> 
> >   	if (do_hot_add)
> > @@ -1807,10 +1808,13 @@ static int balloon_connect_vsp(struct hv_device *dev)
> >   	cap_msg.caps.cap_bits.hot_add = hot_add_enabled();
> >
> >   	/*
> > -	 * Specify our alignment requirements as it relates
> > -	 * memory hot-add. Specify 128MB alignment.
> > +	 * Specify our alignment requirements for memory hot-add. The value is
> > +	 * the log base 2 of the number of megabytes in a chunk. For example,
> > +	 * with 256 MiB chunks, the value is 8. The number of MiB in a chunk
> > +	 * must be a power of 2.
> >   	 */
> > -	cap_msg.caps.cap_bits.hot_add_alignment = 7;
> > +	cap_msg.caps.cap_bits.hot_add_alignment =
> > +			ilog2(ha_pages_in_chunk >> (20 - PAGE_SHIFT));
> 
> I was wondering if we can remove some of the magic here. Something along
> the lines of:
> 
> ilog2(ha_pages_in_chunk / (SZ_1M >> PAGE_SHIFT))
> 
> or simply
> 
> #define HA_BYTES_IN_CHUNK (ha_pages_in_chunk << PAGE_SHIFT)
> 
> ilog2(HA_BYTES_IN_CHUNK / SZ_1M)
> 
> 
> Apart from that nothing jumped at me; looks much cleaner.
> 
> Reviewed-by: David Hildenbrand <david@xxxxxxxxxx>
> 

David -- I need to respin anyway because I missed a dependency on
CONFIG_MEMORY_HOTPLUG as pointed out by the kernel test robot.
I'll add your suggestion to that respin.

Michael