Re: [PATCH v2 5/8] accel/qaic: Add datapath

Stanislaw Gruszka <stanislaw.gruszka@xxxxxxxxxxxxxxx> · Fri, 3 Mar 2023 14:49:32 +0100

On Wed, Mar 01, 2023 at 11:14:35AM -0700, Jeffrey Hugo wrote:
> On 3/1/2023 10:05 AM, Stanislaw Gruszka wrote:
> > On Wed, Mar 01, 2023 at 09:08:03AM -0700, Jeffrey Hugo wrote:
> > > > This looks a bit suspicious. Are you sure you can modify
> > > > sg->dma_address and still use it as valid value ?
> > > 
> > > A single entry in the sg table is a contiguous mapping of memory.  If it
> > > wasn't contiguous, it would have to be broken up into multiple entries.  In
> > > the simple case, a driver is going to take the dma_address/len pair and hand
> > > that directly to the device.  Then the device is going to access every
> > > address in that range.
> > > 
> > > If the device can access every address from dma_address to dma_address +
> > > len, why can't it access a subset of that?
> > 
> > Required address alignment can be broken. Not sure if only that.
> 
> AIC100 doesn't have required alignment.  AIC100 can access any 64-bit
> address, at a byte level granularity.  The only restriction AIC100 has is
> that the size of a transfer is restricted to a 32-bit value, so max
> individual transfer size of 4GB.  Transferring more than 4GB requires
> multiple transactions.
> 
> > > > > Are you suggesting renaming
> > > > > this function?  I guess I'm not quite understanding your comment here. Can
> > > > > you elaborate?
> > > > 
> > > > Renaming would be nice. I was thinking by simplifying it, not sure
> > > > now if that's easy achievable, though.
> > > 
> > > Ok.  I'll think on this.
> > 
> > Maybe this function could be removed ? And create sg lists
> > that hardware can handle without any modification.
> > Just idea to consider, not any requirement.
> 
> Ok, so this is part of our "slicing" operation, and thus required.
> 
> Maybe how slicing works is not clear.
> 
> Lets say that we have a workload on AIC100 that can identify a license plate
> in a picture (aka lprnet).  Lets assume this workload only needs the RGB
> values of a RGBA file (a "jpeg" we are processing).
> 
> Userspace allocates a BO to hold the entire file.  A quarter of the file is
> R values, a quarter is G values, etc.  For simplicity, lets assume the R
> values are all sequentially listed, then the G values, then the B values,
> finally the A values.  When we allocate the BO, we map it once.  If we have
> an IOMMU, this optimizes the IOMMU mappings.  BOs can be quite large.  We
> have some test workloads based on real world workloads where each BO is
> 16-32M in size, and there are multiple BOs.  I don't want to map a 32M BO N
> duplicate times in the IOMMU.
> 
> So, now userspace slices the BO.  It tells us we need to transfer the RGB
> values (the first 75% of the BO), but not the A values.  So, we create a
> copy of the mapped SG and edit it to represent this transfer, which is a
> subset of the entire BO.  Using the slice information and the mapping
> information, we construct the DMA engine commands that can be used to
> transfer the relevant portions of the BO to the device.
> 
> It sounds like you are suggesting, lets flip this around.  Don't map the
> entire BO once.  Instead, wait for the slice info from userspace, construct
> a sg list based on the parts of the BO for the slice, and map that.  Then
> the driver gets a mapped SG it can just use.  The issue I see with that is
> slices can overlap.  You can transfer the same part of a BO multiple times.
> Maybe lprnet has multiple threads on AIC100 where thread A consumes R data,
> thread B consumes R and G data, and thread C consumes B data.  We need to
> transfer the R data twice to different device locations so that threads A
> and B can consume the R data independently.
> 
> If we map per slice, we are going to map the R part of the BO twice in the
> IOMMU.  Is that valid?  It feels possible that there exists some IOMMU
> implementation that won't allow multiple IOVAs to map to the same DDR PA
> because that is weird and the implementer thinks its a software bug.  I
> don't want to run into that.  Assuming it is valid, that is multiple
> mappings in the IOMMU TLB which could have been a single mapping.  We are
> wasting IOMMU resources.
> 
> There are some ARM systems we support with limited IOVA space in the IOMMU,
> and we've had some issues with exhausting that space.  The current
> implementation is influenced by those experiences.

Ok, then the current implementation seems reasonable.
Thanks for explanation!

Regards
Stanislaw