On 3/1/2023 10:05 AM, Stanislaw Gruszka wrote:
On Wed, Mar 01, 2023 at 09:08:03AM -0700, Jeffrey Hugo wrote:
This looks a bit suspicious. Are you sure you can modify
sg->dma_address and still use it as valid value ?
A single entry in the sg table is a contiguous mapping of memory. If it
wasn't contiguous, it would have to be broken up into multiple entries. In
the simple case, a driver is going to take the dma_address/len pair and hand
that directly to the device. Then the device is going to access every
address in that range.
If the device can access every address from dma_address to dma_address +
len, why can't it access a subset of that?
Required address alignment can be broken. Not sure if only that.
AIC100 doesn't have required alignment. AIC100 can access any 64-bit
address, at a byte level granularity. The only restriction AIC100 has
is that the size of a transfer is restricted to a 32-bit value, so max
individual transfer size of 4GB. Transferring more than 4GB requires
multiple transactions.
Are you suggesting renaming
this function? I guess I'm not quite understanding your comment here. Can
you elaborate?
Renaming would be nice. I was thinking by simplifying it, not sure
now if that's easy achievable, though.
Ok. I'll think on this.
Maybe this function could be removed ? And create sg lists
that hardware can handle without any modification.
Just idea to consider, not any requirement.
Ok, so this is part of our "slicing" operation, and thus required.
Maybe how slicing works is not clear.
Lets say that we have a workload on AIC100 that can identify a license
plate in a picture (aka lprnet). Lets assume this workload only needs
the RGB values of a RGBA file (a "jpeg" we are processing).
Userspace allocates a BO to hold the entire file. A quarter of the file
is R values, a quarter is G values, etc. For simplicity, lets assume
the R values are all sequentially listed, then the G values, then the B
values, finally the A values. When we allocate the BO, we map it once.
If we have an IOMMU, this optimizes the IOMMU mappings. BOs can be
quite large. We have some test workloads based on real world workloads
where each BO is 16-32M in size, and there are multiple BOs. I don't
want to map a 32M BO N duplicate times in the IOMMU.
So, now userspace slices the BO. It tells us we need to transfer the
RGB values (the first 75% of the BO), but not the A values. So, we
create a copy of the mapped SG and edit it to represent this transfer,
which is a subset of the entire BO. Using the slice information and the
mapping information, we construct the DMA engine commands that can be
used to transfer the relevant portions of the BO to the device.
It sounds like you are suggesting, lets flip this around. Don't map the
entire BO once. Instead, wait for the slice info from userspace,
construct a sg list based on the parts of the BO for the slice, and map
that. Then the driver gets a mapped SG it can just use. The issue I
see with that is slices can overlap. You can transfer the same part of
a BO multiple times. Maybe lprnet has multiple threads on AIC100 where
thread A consumes R data, thread B consumes R and G data, and thread C
consumes B data. We need to transfer the R data twice to different
device locations so that threads A and B can consume the R data
independently.
If we map per slice, we are going to map the R part of the BO twice in
the IOMMU. Is that valid? It feels possible that there exists some
IOMMU implementation that won't allow multiple IOVAs to map to the same
DDR PA because that is weird and the implementer thinks its a software
bug. I don't want to run into that. Assuming it is valid, that is
multiple mappings in the IOMMU TLB which could have been a single
mapping. We are wasting IOMMU resources.
There are some ARM systems we support with limited IOVA space in the
IOMMU, and we've had some issues with exhausting that space. The
current implementation is influenced by those experiences.
I appreciate the outsider perspective (here and elsewhere).
Re-evaluating some of these things is resulting in improvements.
-Jeff