On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@xxxxxxxxx> wrote: > > On 1/7/21 9:42 AM, Claire Chang wrote: > > >> Can you explain how ATF gets involved and to what extent it does help, > >> besides enforcing a secure region from the ARM CPU's perpsective? Does > >> the PCIe root complex not have an IOMMU but can somehow be denied access > >> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is > >> still some sort of basic protection that the HW enforces, right? > > > > We need the ATF support for memory MPU (memory protection unit). > > Restricted DMA (with reserved-memory in dts) makes sure the predefined memory > > region is for PCIe DMA only, but we still need MPU to locks down PCIe access to > > that specific regions. > > OK so you do have a protection unit of some sort to enforce which region > in DRAM the PCIE bridge is allowed to access, that makes sense, > otherwise the restricted DMA region would only be a hint but nothing you > can really enforce. This is almost entirely analogous to our systems then. Here is the example of setting the MPU: https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132 > > There may be some value in standardizing on an ARM SMCCC call then since > you already support two different SoC vendors. > > > > >> > >> On Broadcom STB SoCs we have had something similar for a while however > >> and while we don't have an IOMMU for the PCIe bridge, we do have a a > >> basic protection mechanism whereby we can configure a region in DRAM to > >> be PCIe read/write and CPU read/write which then gets used as the PCIe > >> inbound region for the PCIe EP. By default the PCIe bridge is not > >> allowed access to DRAM so we must call into a security agent to allow > >> the PCIe bridge to access the designated DRAM region. > >> > >> We have done this using a private CMA area region assigned via Device > >> Tree, assigned with a and requiring the PCIe EP driver to use > >> dma_alloc_from_contiguous() in order to allocate from this device > >> private CMA area. The only drawback with that approach is that it > >> requires knowing how much memory you need up front for buffers and DMA > >> descriptors that the PCIe EP will need to process. The problem is that > >> it requires driver modifications and that does not scale over the number > >> of PCIe EP drivers, some we absolutely do not control, but there is no > >> need to bounce buffer. Your approach scales better across PCIe EP > >> drivers however it does require bounce buffering which could be a > >> performance hit. > > > > Only the streaming DMA (map/unmap) needs bounce buffering. > > True, and typically only on transmit since you don't really control > where the sk_buff are allocated from, right? On RX since you need to > hand buffer addresses to the WLAN chip prior to DMA, you can allocate > them from a pool that already falls within the restricted DMA region, right? > Right, but applying bounce buffering to RX will make it more secure. The device won't be able to modify the content after unmap. Just like what iommu_unmap does. > > I also added alloc/free support in this series > > (https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will > > try to allocate memory from the predefined memory region. > > > > As for the performance hit, it should be similar to the default swiotlb. > > Here are my experiment results. Both SoCs lack IOMMU for PCIe. > > > > PCIe wifi vht80 throughput - > > > > MTK SoC tcp_tx tcp_rx udp_tx udp_rx > > w/o Restricted DMA 244.1 134.66 312.56 350.79 > > w/ Restricted DMA 246.95 136.59 363.21 351.99 > > > > Rockchip SoC tcp_tx tcp_rx udp_tx udp_rx > > w/o Restricted DMA 237.87 133.86 288.28 361.88 > > w/ Restricted DMA 256.01 130.95 292.28 353.19 > > How come you get better throughput with restricted DMA? Is it because > doing DMA to/from a contiguous region allows for better grouping of > transactions from the DRAM controller's perspective somehow? I'm not sure, but actually, enabling the default swiotlb for wifi also helps the throughput a little bit for me. > > > > > The CPU usage doesn't increase too much either. > > Although I didn't measure the CPU usage very precisely, it's ~3% with a single > > big core (Cortex-A72) and ~5% with a single small core (Cortex-A53). > > > > Thanks! > > > >> > >> Thanks! > >> -- > >> Florian > > > -- > Florian