RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64

Michael Kelley <mhklinux@xxxxxxxxxxx> · Wed, 12 Jun 2024 02:42:03 +0000

From: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> Sent: Tuesday, June 11, 2024 10:44 AM
> 
> > -----Original Message-----
> > From: Michael Kelley <mailto:mhklinux@xxxxxxxxxxx>
> > Sent: Tuesday, June 11, 2024 12:35 PM
> > To: Haiyang Zhang <mailto:haiyangz@xxxxxxxxxxxxx>; mailto:linux-hyperv@xxxxxxxxxxxxxxx;
> > mailto:netdev@xxxxxxxxxxxxxxx
> > Cc: Dexuan Cui <mailto:decui@xxxxxxxxxxxxx>; mailto:stephen@xxxxxxxxxxxxxxxxxx; KY
> > Srinivasan <mailto:kys@xxxxxxxxxxxxx>; Paul Rosswurm <mailto:paulros@xxxxxxxxxxxxx>;
> > mailto:olaf@xxxxxxxxx; vkuznets <mailto:vkuznets@xxxxxxxxxx>; mailto:davem@xxxxxxxxxxxxx;
> > mailto:wei.liu@xxxxxxxxxx; mailto:edumazet@xxxxxxxxxx; mailto:kuba@xxxxxxxxxx;
> > mailto:pabeni@xxxxxxxxxx; mailto:leon@xxxxxxxxxx; Long Li <mailto:longli@xxxxxxxxxxxxx>;
> > mailto:ssengar@xxxxxxxxxxxxxxxxxxx; mailto:linux-rdma@xxxxxxxxxxxxxxx;
> > mailto:daniel@xxxxxxxxxxxxx; mailto:john.fastabend@xxxxxxxxx; mailto:bpf@xxxxxxxxxxxxxxx;
> > mailto:ast@xxxxxxxxxx; mailto:hawk@xxxxxxxxxx; mailto:tglx@xxxxxxxxxxxxx;
> > mailto:shradhagupta@xxxxxxxxxxxxxxxxxxx; mailto:linux-kernel@xxxxxxxxxxxxxxx
> > Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> > sizes of ARM64
> >
> > From: LKML haiyangz <mailto:lkmlhyz@xxxxxxxxxxxxx> On Behalf Of Haiyang Zhang
> > Sent: Monday, June 10, 2024 2:23 PM
> > >
> > > As defined by the MANA Hardware spec, the queue size for DMA is 4KB
> > > minimal, and power of 2.
> >
> > You say the hardware requires 4K "minimal". But the definitions in this
> > patch hardcode to 4K, as if that's the only choice. Is the hardcoding to
> > 4K a design decision made to simplify the MANA driver?
> 
> The HWC q size has to be exactly 4k, which is by HW design.
> Other "regular" queues can be 2^n >= 4k.
> 
> >
> > > To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define
> >
> > A minor nit, but "variable" page size doesn't seem like quite the right
> > description -- both here and in the Subject line.  On ARM64, the page size
> > is a choice among a few fixed options.  Perhaps call it support for "page sizes
> > other than 4K"?
> 
> "page sizes other than 4K" sounds good.
> 
> >
> > > the minimal queue size as a macro separate from the PAGE_SIZE, which
> > > we always assumed it to be 4KB before supporting ARM64.
> > > Also, update the relevant code related to size alignment, DMA region
> > > calculations, etc.
> > >
> > > Signed-off-by: Haiyang Zhang <mailto:haiyangz@xxxxxxxxxxxxx>
> > > ---
> > >  drivers/net/ethernet/microsoft/Kconfig        |  2 +-
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
> > >  .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
> > >  drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
> > >  .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
> > >  include/net/mana/gdma.h                       |  7 +++++-
> > >  include/net/mana/mana.h                       |  3 ++-
> > >  7 files changed, 33 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/microsoft/Kconfig
> > > b/drivers/net/ethernet/microsoft/Kconfig
> > > index 286f0d5697a1..901fbffbf718 100644
> > > --- a/drivers/net/ethernet/microsoft/Kconfig
> > > +++ b/drivers/net/ethernet/microsoft/Kconfig
> > > @@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
> > >  config MICROSOFT_MANA
> > >  tristate "Microsoft Azure Network Adapter (MANA) support"
> > >  depends on PCI_MSI
> > > - depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
> > > + depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
> > >  depends on PCI_HYPERV
> > >  select AUXILIARY_BUS
> > >  select PAGE_POOL
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 1332db9a08eb..c9df942d0d02 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc,
> > > unsigned int length,
> > >  dma_addr_t dma_handle;
> > >  void *buf;
> > >
> > > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> > >         return -EINVAL;
> > >
> > >  gmi->dev = gc->dev;
> > > @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region, NET_MANA);
> > >  static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >                          struct gdma_mem_info *gmi)
> > >  {
> > > - unsigned int num_page = gmi->length / PAGE_SIZE;
> > > + unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
> >
> > This calculation seems a bit weird when using MANA_MIN_QSIZE. The
> > number of pages, and the construction of the page_addr_list array
> > a few lines later, seem unrelated to the concept of a minimum queue
> > size. Is the right concept really a "mapping chunk", and num_page
> > would conceptually be "num_chunks", or something like that?  Then
> > a queue must be at least one chunk in size, but that's derived from the
> > chunk size, and is not the core concept.
> 
> I think calling it "num_chunks" is fine.
> May I use "num_chunks" in next version?
> 

I think first is the decision on what to use for MANA_MIN_QSIZE. I'm
admittedly not familiar with mana and gdma, but the function
mana_gd_create_dma_region() seems fairly typical in defining a
logical region that spans multiple 4K chunks that may or may not
be physically contiguous.  So you set up an array of physical
addresses that identify the physical memory location of each chunk.
It seems very similar to a Hyper-V GPADL. I typically *do* see such
chunks called "pages", but a "mapping chunk" or "mapping unit"
is probably OK too.  So mana_gd_create_dma_region() would use
MANA_CHUNK_SIZE instead of MANA_MIN_QSIZE.  Then you could
also define MANA_MIN_QSIZE to be MANA_CHUNK_SIZE, for use
specifically when checking the size of a queue.

Then if you are using MANA_CHUNK_SIZE, the local variable
would be "num_chunks".  The use of "page_count", "page_addr_list",
and "offset_in_page" field names in struct
gdma_create_dma_region_req should then be changed as well.

Looking further at the function mana_gd_create_dma_region(),
there's also the use of offset_in_page(), which is based on the
guest PAGE_SIZE.  Wouldn't this be problematic if PAGE_SIZE
is 64K?

But perhaps Paul would weigh in further on his thoughts.

> >
> > Another approach might be to just call it "MANA_PAGE_SIZE", like
> > has been done with HV_HYP_PAGE_SIZE.  HV_HYP_PAGE_SIZE exists to
> > handle exactly the same issue of the guest PAGE_SIZE potentially
> > being different from the fixed 4K size that must be used in host-guest
> > communication on Hyper-V.  Same thing here with MANA.
> 
> I actually called it "MANA_PAGE_SIZE" in my previous internal patch.
> But Paul from Hostnet team opposed using that name, because
> 4kB is the min q size. MANA doesn't have "page" at HW level.
> 
> 
> > >  struct gdma_create_dma_region_req *req = NULL;
> > >  struct gdma_create_dma_region_resp resp = {};
> > >  struct gdma_context *gc = gd->gdma_context;
> > > @@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >  int err;
> > >  int i;
> > >
> > > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> > >         return -EINVAL;
> > >
> > >  if (offset_in_page(gmi->virt_addr) != 0)
> > > @@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >  req->page_addr_list_len = num_page;
> > >
> > >  for (i = 0; i < num_page; i++)
> > > -       req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
> > > +       req->page_addr_list[i] = gmi->dma_handle +  i * MANA_MIN_QSIZE;
> > >
> > >  err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp), &resp);
> > >  if (err)
> > > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > index bbc4f9e16c98..038dc31e09cd 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > @@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct hw_channel_context
> > > *hwc, u16 q_depth,
> > >  int err;
> > >
> > >  eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
> > > - if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (eq_size < MANA_MIN_QSIZE)
> > > +       eq_size = MANA_MIN_QSIZE;
> > >
> > >  cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
> > > - if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (cq_size < MANA_MIN_QSIZE)
> > > +       cq_size = MANA_MIN_QSIZE;
> > >
> > >  hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
> > >  if (!hwc_cq)
> > > @@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct hw_channel_context *hwc, u16 q_depth,
> > >
> > >  dma_buf->num_reqs = q_depth;
> > >
> > > - buf_size = PAGE_ALIGN(q_depth * max_msg_size);
> > > + buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
> > >
> > >  gmi = &dma_buf->mem_info;
> > >  err = mana_gd_alloc_memory(gc, buf_size, gmi);
> > > @@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct hw_channel_context
> > > *hwc,
> > >  else
> > >         queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE * q_depth);
> > >
> > > - if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (queue_size < MANA_MIN_QSIZE)
> > > +       queue_size = MANA_MIN_QSIZE;
> > >
> > >  hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
> > >  if (!hwc_wq)
> > > @@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct
> > > gdma_context *gc, u16 *q_depth,
> > >  init_completion(&hwc->hwc_init_eqe_comp);
> > >
> > >  err = mana_smc_setup_hwc(&gc->shm_channel, false,
> > > -                   eq->mem_info.dma_handle,
> > > -                   cq->mem_info.dma_handle,
> > > -                   rq->mem_info.dma_handle,
> > > -                   sq->mem_info.dma_handle,
> > > +                   virt_to_phys(eq->mem_info.virt_addr),
> > > +                   virt_to_phys(cq->mem_info.virt_addr),
> > > +                   virt_to_phys(rq->mem_info.virt_addr),
> > > +                   virt_to_phys(sq->mem_info.virt_addr),
> >
> > This change seems unrelated to handling guest PAGE_SIZE values
> > other than 4K.  Does it belong in a separate patch?  Or maybe it just
> > needs an explanation in the commit message of this patch?
> 
> I know dma_handle is usually just the phys adr. But this is not always
> True if IOMMU is used...
> I have no problem to put it to a separate patch if desired.

Yes, that would seem like a separate patch.  It's not related to handling
page size other than 4K.

> 
> >
> > >                      eq->eq.msix_index);
> > >  if (err)
> > >         return err;
> > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > index d087cf954f75..6a891dbce686 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > @@ -1889,10 +1889,10 @@ static int mana_create_txq(struct mana_port_context
> > > *apc,
> > >   *  to prevent overflow.
> > >   */
> > >  txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
> > > - BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
> > > + BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
> > >
> > >  cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
> > > - cq_size = PAGE_ALIGN(cq_size);
> > > + cq_size = MANA_MIN_QALIGN(cq_size);
> > >
> > >  gc = gd->gdma_context;
> > >
> > > @@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
> > >  if (err)
> > >         goto out;
> > >
> > > - rq_size = PAGE_ALIGN(rq_size);
> > > - cq_size = PAGE_ALIGN(cq_size);
> > > + rq_size = MANA_MIN_QALIGN(rq_size);
> > > + cq_size = MANA_MIN_QALIGN(cq_size);
> > >
> > >  /* Create RQ */
> > >  memset(&spec, 0, sizeof(spec));
> > > diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > index 5553af9c8085..9a54a163d8d1 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > @@ -6,6 +6,7 @@
> > >  #include <linux/io.h>
> > >  #include <linux/mm.h>
> > >
> > > +#include <net/mana/gdma.h>
> > >  #include <net/mana/shm_channel.h>
> > >
> > >  #define PAGE_FRAME_L48_WIDTH_BYTES 6
> > > @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
> > >
> > >  /* EQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(eq_addr);
> > > + frame_addr = MANA_PFN(eq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> >
> > In mana_smc_setup_hwc() a few lines above this change, code using
> > PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq
> > addresses
> > must be aligned to 64K if PAGE_SIZE is 64K?
> 
> Since we still using PHYS_PFN on them, if not aligned to PAGE_SIZE,
> the lower bits may be lost. (You said the same below.)
> 
> >
> > Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
> > MANA_PFN() will first right-shift 16, then left shift 4. The net is
> > right-shift 12,
> > corresponding to the 4K chunks that MANA expects. But that approach
> > guarantees
> > that the rightmost 4 bits of the MANA PFN will always be zero. That's
> > consistent
> > with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm unclear
> > whether
> > that is really the requirement. You might compare with the definition of
> > HVPFN_DOWN(), which has a similar goal for Linux guests communicating
> > with
> > Hyper-V.
> 
> @Paul Rosswurm You said MANA HW has "no page concept". So the "frame_addr"
> In the mana_smc_setup_hwc() is NOT related to physical page number, correct?
> Can we just use phys_adr >> 12 like below?
> 
> #define MANA_MIN_QSHIFT 12
> #define MANA_PFN(a) ((a) >> MANA_MIN_QSHIFT)
> 
>       /* EQ addr: low 48 bits of frame address */
>      shmem = (u64 *)ptr;
> -     frame_addr = PHYS_PFN(eq_addr);
> +     frame_addr = MANA_PFN(eq_addr);
> 
> >
> > > @@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* CQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(cq_addr);
> > > + frame_addr = MANA_PFN(cq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > @@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* RQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(rq_addr);
> > > + frame_addr = MANA_PFN(rq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > @@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* SQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(sq_addr);
> > > + frame_addr = MANA_PFN(sq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> > > index 27684135bb4d..b392559c33e9 100644
> > > --- a/include/net/mana/gdma.h
> > > +++ b/include/net/mana/gdma.h
> > > @@ -224,7 +224,12 @@ struct gdma_dev {
> > >  struct auxiliary_device *adev;
> > >  };
> > >
> > > -#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
> > > +/* These are defined by HW */
> > > +#define MANA_MIN_QSHIFT 12
> > > +#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
> > > +#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
> > > +#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr),
> > MANA_MIN_QSIZE)
> > > +#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))
> >
> > See comments above about how this is defined.
> 
> Replied above.
> Thank you for all the detailed comments!
> 
> - Haiyang