On Wed, Mar 01, 2023 at 11:14:35AM -0700, Jeffrey Hugo wrote: > On 3/1/2023 10:05 AM, Stanislaw Gruszka wrote: > > On Wed, Mar 01, 2023 at 09:08:03AM -0700, Jeffrey Hugo wrote: > > > > This looks a bit suspicious. Are you sure you can modify > > > > sg->dma_address and still use it as valid value ? > > > > > > A single entry in the sg table is a contiguous mapping of memory. If it > > > wasn't contiguous, it would have to be broken up into multiple entries. In > > > the simple case, a driver is going to take the dma_address/len pair and hand > > > that directly to the device. Then the device is going to access every > > > address in that range. > > > > > > If the device can access every address from dma_address to dma_address + > > > len, why can't it access a subset of that? > > > > Required address alignment can be broken. Not sure if only that. > > AIC100 doesn't have required alignment. AIC100 can access any 64-bit > address, at a byte level granularity. The only restriction AIC100 has is > that the size of a transfer is restricted to a 32-bit value, so max > individual transfer size of 4GB. Transferring more than 4GB requires > multiple transactions. > > > > > > Are you suggesting renaming > > > > > this function? I guess I'm not quite understanding your comment here. Can > > > > > you elaborate? > > > > > > > > Renaming would be nice. I was thinking by simplifying it, not sure > > > > now if that's easy achievable, though. > > > > > > Ok. I'll think on this. > > > > Maybe this function could be removed ? And create sg lists > > that hardware can handle without any modification. > > Just idea to consider, not any requirement. > > Ok, so this is part of our "slicing" operation, and thus required. > > Maybe how slicing works is not clear. > > Lets say that we have a workload on AIC100 that can identify a license plate > in a picture (aka lprnet). Lets assume this workload only needs the RGB > values of a RGBA file (a "jpeg" we are processing). > > Userspace allocates a BO to hold the entire file. A quarter of the file is > R values, a quarter is G values, etc. For simplicity, lets assume the R > values are all sequentially listed, then the G values, then the B values, > finally the A values. When we allocate the BO, we map it once. If we have > an IOMMU, this optimizes the IOMMU mappings. BOs can be quite large. We > have some test workloads based on real world workloads where each BO is > 16-32M in size, and there are multiple BOs. I don't want to map a 32M BO N > duplicate times in the IOMMU. > > So, now userspace slices the BO. It tells us we need to transfer the RGB > values (the first 75% of the BO), but not the A values. So, we create a > copy of the mapped SG and edit it to represent this transfer, which is a > subset of the entire BO. Using the slice information and the mapping > information, we construct the DMA engine commands that can be used to > transfer the relevant portions of the BO to the device. > > It sounds like you are suggesting, lets flip this around. Don't map the > entire BO once. Instead, wait for the slice info from userspace, construct > a sg list based on the parts of the BO for the slice, and map that. Then > the driver gets a mapped SG it can just use. The issue I see with that is > slices can overlap. You can transfer the same part of a BO multiple times. > Maybe lprnet has multiple threads on AIC100 where thread A consumes R data, > thread B consumes R and G data, and thread C consumes B data. We need to > transfer the R data twice to different device locations so that threads A > and B can consume the R data independently. > > If we map per slice, we are going to map the R part of the BO twice in the > IOMMU. Is that valid? It feels possible that there exists some IOMMU > implementation that won't allow multiple IOVAs to map to the same DDR PA > because that is weird and the implementer thinks its a software bug. I > don't want to run into that. Assuming it is valid, that is multiple > mappings in the IOMMU TLB which could have been a single mapping. We are > wasting IOMMU resources. > > There are some ARM systems we support with limited IOVA space in the IOMMU, > and we've had some issues with exhausting that space. The current > implementation is influenced by those experiences. Ok, then the current implementation seems reasonable. Thanks for explanation! Regards Stanislaw