Re: [PATCH 3/3] tile pci: enable IOMMU to support DMA for legacy devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 13, 2012 at 11:52:11AM -0400, Chris Metcalf wrote:
> Sorry for the slow reply to your feedback; I had to coordinate with our
> primary PCI developer (in another timezone) and we both had various
> unrelated fires to fight along the way.
> 
> I've appended the patch that corrects all the issues you reported. Bjorn,
> I'm assuming that it's appropriate for me to push this change through the
> tile tree (along with all the infrastructural changes to support the
> TILE-Gx TRIO shim that implements PCIe for our chip) rather than breaking
> it out to push it through the pci tree; does that sound correct to you?
> 
> On 6/22/2012 7:24 AM, Bjorn Helgaas wrote:
> > On Fri, Jun 15, 2012 at 1:23 PM, Chris Metcalf <cmetcalf@xxxxxxxxxx> wrote:
> >> This change uses the TRIO IOMMU to map the PCI DMA space and physical
> >> memory at different addresses.  We also now use the dma_mapping_ops
> >> to provide support for non-PCI DMA, PCIe DMA (64-bit) and legacy PCI
> >> DMA (32-bit).  We use the kernel's software I/O TLB framework
> >> (i.e. bounce buffers) for the legacy 32-bit PCI device support since
> >> there are a limited number of TLB entries in the IOMMU and it is
> >> non-trivial to handle indexing, searching, matching, etc.  For 32-bit
> >> devices the performance impact of bounce buffers should not be a concern.
> >>
> >>
> >> +extern void
> >> +pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
> >> +                       struct resource *res);
> >> +
> >> +extern void
> >> +pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
> >> +                       struct pci_bus_region *region);
> > These extern declarations look like leftovers that shouldn't be needed.
> 
> Thanks. Removed.
> 
> >> +/* PCI I/O space support is not implemented. */
> >> +static struct resource pci_ioport_resource = {
> >> +       .name   = "PCI IO",
> >> +       .start  = 0,
> >> +       .end    = 0,
> >> +       .flags  = IORESOURCE_IO,
> >> +};
> > You don't need to define pci_ioport_resource at all if you don't
> > support I/O space.
> 
> We have some internal changes to support I/O space, but for now I've gone
> ahead and removed pci_ioport_resource.
> 
> >> +               /*
> >> +                * The PCI memory resource is located above the PA space.
> >> +                * The memory range for the PCI root bus should not overlap
> >> +                * with the physical RAM
> >> +                */
> >> +               pci_add_resource_offset(&resources, &iomem_resource,
> >> +                                       1ULL << CHIP_PA_WIDTH());
> > This says that your entire physical address space (currently
> > 0x0-0xffffffff_ffffffff) is routed to the PCI bus, which is not true.
> > I think what you want here is pci_iomem_resource, but I'm not sure
> > that's set up correctly.  It should contain the CPU physical address
> > that are routed to the PCI bus.  Since you mention an offset, the PCI
> > bus addresses will "CPU physical address - offset".
> 
> Yes, we've changed it to use pci_iomem_resource. On TILE-Gx, there are two
> types of CPU physical addresses: physical RAM addresses and MMIO addresses.
> The MMIO address has the MMIO attribute in the page table. So, the physical
> address spaces for the RAM and the PCI are completely separate. Instead, we
> have the following relationship: PCI bus address = PCI resource address -
> offset, where the PCI resource addresses are defined by pci_iomem_resource
> and they are never generated by the CPU.

Does that mean the MMIO addresses are not accessible when the CPU
is in physical mode, and you can only reach them via a virtual address
mapped with the MMIO attribute?  If so, then I guess you're basically
combining RAM addresses and MMIO addresses into iomem_resource by
using high "address bits" to represent the MMIO attribute?

> > I don't understand the CHIP_PA_WIDTH() usage -- that seems to be the
> > physical address width, but you define TILE_PCI_MEM_END as "((1ULL <<
> > CHIP_PA_WIDTH()) + TILE_PCI_BAR_WINDOW_TOP)", which would mean the CPU
> > could never generate that address.
> 
> Exactly. The CPU-generated physical addresses for the PCI space, i.e. the
> MMIO addresses, have an address format that is defined by the RC
> controller. They go to the RC controller directly, because the page table
> entry also encodes the RC controller’s location on the chip.
> 
> > I might understand this better if you could give a concrete example of
> > the CPU address range and the corresponding PCI bus address range.
> > For example, I have a box where CPU physical address range [mem
> > 0xf0000000000-0xf007edfffff] is routed to PCI bus address range
> > [0x80000000-0xfedfffff].  In this case, the struct resource contains
> > 0xf0000000000-0xf007edfffff, and the offset is 0xf0000000000 -
> > 0x80000000 or 0xeff80000000.
> 
> The TILE-Gx chip’s CHIP_PA_WIDTH is 40-bit. In the following example, the
> system has 32GB RAM installed, with 16GB in each of the 2 memory
> controllers. For the first mvsas device, its PCI memory resource is
> [0x100c0000000, 0x100c003ffff], the corresponding PCI bus address range is
> [0xc0000000, 0xc003ffff] after subtracting the offset of (1ul << 40). The
> aforementioned PCI MMIO address’s low 32-bits contains the PCI bus address.
> 
> # cat /proc/iomem
> 00000000-3fbffffff : System RAM
> 00000000-007eeb1f : Kernel code
> 00860000-00af6e4b : Kernel data
> 4000000000-43ffffffff : System RAM
> 100c0000000-100c003ffff : mvsas
> 100c0040000-100c005ffff : mvsas
> 100c0200000-100c0203fff : sky2
> 100c0300000-100c0303fff : sata_sil24
> 100c0304000-100c030407f : sata_sil24
> 100c0400000-100c0403fff : sky2
> 
> Note that in above example, the 2 mvsas devices are in a separate PCI
> domain than the other 4 devices.

It sounds like you're describing something like this:

  host bridge 0
    resource [mem 0x100_c0000000-0x100_c00fffff] (offset 0x100_00000000)
    bus addr [mem 0xc0000000-0xc00fffff]
  host bridge 2
    resource [mem 0x100_c0200000-0x100_c02fffff] (offset 0x100_00000000)
    bus addr [mem 0xc0200000-0xc02fffff]
  host bridge 3
    resource [mem 0x100_c0300000-0x100_c03fffff] (offset 0x100_00000000)
    bus addr [mem 0xc0300000-0xc03fffff]

If PCI bus addresses are simply the low 32 bits of the MMIO address,
there's nothing in the PCI core that should prevent you from giving a
full 4GB of bus address space to each bridge, e.g.:

  host bridge 0
    resource [mem 0x100_00000000-0x100_ffffffff] (offset 0x100_00000000)
    bus addr [mem 0x00000000-0xffffffff]
  host bridge 2
    resource [mem 0x102_00000000-0x102_ffffffff] (offset 0x102_00000000)
    bus addr [mem 0x00000000-0xffffffff]
  host bridge 3
    resource [mem 0x103_00000000-0x103_ffffffff] (offset 0x103_00000000)
    bus addr [mem 0x00000000-0xffffffff]

> > The comments at TILE_PCI_MEM_MAP_BASE_OFFSET suggest that you have two
> > MMIO regions (one for bus addresses <4GB), so there should be two
> > resources on the list here.
> 
> There is a single MMIO region, defined by the corresponding resource
> pci_iomem_resource. The TILE_PCI_MEM_MAP_BASE_OFFSET is used in the context
> of inbound access only, i.e. for DMA access. Yes, there are two inbound
> windows. First is [1ULL << CHIP_PA_WIDTH(), 1ULL << CHIP_PA_WIDTH() + 1],
> used by devices that can generate 64-bit DMA addresses. The HW IOMMU is
> used to derive the real RAM address by subtracting 1ULL << CHIP_PA_WIDTH()
> from the DMA address. The second inbound window is [0, 3GB] with direct
> mapping, used by 32-bit devices, where 3GB = 4GB – MMIO_region.

OK.  I'm not concerned with any inbound address issues; that's up to
the DMA API and the IOMMU.

> > The list should also include a bus number resource describing the bus
> > numbers claimed by the host bridge.  Since you don't have that, we'll
> > default to [bus 00-ff], but that's wrong if you have more than one
> > host bridge.
> 
> Fixed.
> 
> > In fact, since it appears that you *do* have multiple host bridges,
> > the "resources" list should be constructed so it contains the bus
> > number and MMIO apertures for each bridge, which should be
> > non-overlapping.
> 
> We use the same pci_iomem_resource for different domains or host bridges,
> but the MMIO apertures for each bridge do not overlap because
> non-overlapping resource ranges are allocated for each domains.

You should not use the same pci_iomem_resource for different host bridges
because that tells the PCI core that everything in pci_iomem_resource is
available for devices under every host bridge, which I doubt is the case.

The fact that your firmware assigns non-overlapping resources is good and
works now, but if the kernel ever needs to allocate resources itself, the
only way to do it correctly is to know what the actual apertures are
for each host bridge.  Eventually, I think the host bridges will also
show up in /proc/iomem, which won't work if their apertures overlap.

Thanks for the detailed response.  I hope I understood it :)

> >>  void __devinit pcibios_fixup_bus(struct pci_bus *bus)
> >>  {
> >> -       /* Nothing needs to be done. */
> >> +       struct pci_dev *dev = bus->self;
> >> +
> >> +       if (!dev) {
> >> +               /* This is the root bus. */
> >> +               bus->resource[0] = &pci_ioport_resource;
> >> +               bus->resource[1] = &pci_iomem_resource;
> >> +       }
> > Please don't add this.  I'm in the process of removing
> > pcibios_fixup_bus() altogether.  Instead, you should put
> > pci_iomem_resource on a resources list and use pci_scan_root_bus().
> 
> I removed the contents of pcibios_fixup_bus(), but leaving the no-op
> function in for now, until after the 3.6 merge.
> 
> >>  /*
> >> - * We reserve all resources above 4GB so that PCI won't try to put
> >> + * On Pro, we reserve all resources above 4GB so that PCI won't try to put
> >>  * mappings above 4GB; the standard allows that for some devices but
> >>  * the probing code trunates values to 32 bits.
> > I think this comment about probing code truncating values is out of
> > date.  Or if it's not, please point me to it so we can fix it :)
> 
> Yes, it's out of date; fixed.
> 
> >> @@ -1588,7 +1585,7 @@ static int __init request_standard_resources(void)
> >>        enum { CODE_DELTA = MEM_SV_INTRPT - PAGE_OFFSET };
> >>
> >>        iomem_resource.end = -1LL;
> > This patch isn't touching iomem_resource, but iomem_resource.end
> > *should* be set to the highest physical address your CPU can generate,
> > which is probably smaller than this.
> 
> This is not necessarily true. True on x86 where the PA space is shared by
> the RAM and the PCI. On TILE-Gx, iomem_resource covers all resources of
> type IORESOURCE_MEM, which include the RAM resource and the PCI resource.
> On the other hand, setting it here is not necessary because it is set to -1
> in iomem_resource’s definition in kernel/resource.c.
> 
> The change follows.
> 
> commit d52776fade4dadf0b034d101f0cd4ce4f8d2f48f
> Author: Chris Metcalf <cmetcalf@xxxxxxxxxx>
> Date:   Sun Jul 1 14:42:49 2012 -0400
> 
>     tile: updates to pci root complex from community feedback
> 
> diff --git a/arch/tile/include/asm/pci.h b/arch/tile/include/asm/pci.h
> index 553b7ff..93a1f14 100644
> --- a/arch/tile/include/asm/pci.h
> +++ b/arch/tile/include/asm/pci.h
> @@ -161,6 +161,7 @@ struct pci_controller {
> 
>         uint64_t mem_offset;    /* cpu->bus memory mapping offset. */
> 
> +       int first_busno;
>         int last_busno;
> 
>         struct pci_ops *ops;
> @@ -179,14 +180,6 @@ extern gxio_trio_context_t trio_contexts[TILEGX_NUM_TRIO];
> 
>  extern void pci_iounmap(struct pci_dev *dev, void __iomem *);
> 
> -extern void
> -pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
> -                       struct resource *res);
> -
> -extern void
> -pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
> -                       struct pci_bus_region *region);
> -
>  /*
>   * The PCI address space does not equal the physical memory address
>   * space (we have an IOMMU). The IDE and SCSI device layers use this
> diff --git a/arch/tile/kernel/pci_gx.c b/arch/tile/kernel/pci_gx.c
> index 27f7ab0..56a3c97 100644
> --- a/arch/tile/kernel/pci_gx.c
> +++ b/arch/tile/kernel/pci_gx.c
> @@ -96,14 +96,6 @@ static struct pci_ops tile_cfg_ops;
>  /* Mask of CPUs that should receive PCIe interrupts. */
>  static struct cpumask intr_cpus_map;
> 
> -/* PCI I/O space support is not implemented. */
> -static struct resource pci_ioport_resource = {
> -       .name   = "PCI IO",
> -       .start  = 0,
> -       .end    = 0,
> -       .flags  = IORESOURCE_IO,
> -};
> -
>  static struct resource pci_iomem_resource = {
>         .name   = "PCI mem",
>         .start  = TILE_PCI_MEM_START,
> @@ -588,6 +580,7 @@ int __init pcibios_init(void)
>  {
>         resource_size_t offset;
>         LIST_HEAD(resources);
> +       int next_busno;
>         int i;
> 
>         tile_pci_init();
> @@ -628,7 +621,7 @@ int __init pcibios_init(void)
>         msleep(250);
> 
>         /* Scan all of the recorded PCI controllers.  */
> -       for (i = 0; i < num_rc_controllers; i++) {
> +       for (next_busno = 0, i = 0; i < num_rc_controllers; i++) {
>                 struct pci_controller *controller = &pci_controllers[i];
>                 gxio_trio_context_t *trio_context = controller->trio;
>                 TRIO_PCIE_INTFC_PORT_CONFIG_t port_config;
> @@ -843,13 +836,14 @@ int __init pcibios_init(void)
>                  * The memory range for the PCI root bus should not overlap
>                  * with the physical RAM
>                  */
> -               pci_add_resource_offset(&resources, &iomem_resource,
> +               pci_add_resource_offset(&resources, &pci_iomem_resource,
>                                         1ULL << CHIP_PA_WIDTH());
> 
> -               bus = pci_scan_root_bus(NULL, 0, controller->ops,
> +               controller->first_busno = next_busno;
> +               bus = pci_scan_root_bus(NULL, next_busno, controller->ops,
>                                         controller, &resources);
>                 controller->root_bus = bus;
> -               controller->last_busno = bus->subordinate;
> +               next_busno = bus->subordinate + 1;
> 
>         }
> 
> @@ -1011,20 +1005,9 @@ alloc_mem_map_failed:
>  }
>  subsys_initcall(pcibios_init);
> 
> -/*
> - * PCI scan code calls the arch specific pcibios_fixup_bus() each time it scans
> - * a new bridge. Called after each bus is probed, but before its children are
> - * examined.
> - */
> +/* Note: to be deleted after Linux 3.6 merge. */
>  void __devinit pcibios_fixup_bus(struct pci_bus *bus)
>  {
> -       struct pci_dev *dev = bus->self;
> -
> -       if (!dev) {
> -               /* This is the root bus. */
> -               bus->resource[0] = &pci_ioport_resource;
> -               bus->resource[1] = &pci_iomem_resource;
> -       }
>  }
> 
>  /*
> @@ -1172,11 +1155,11 @@ static int __devinit tile_cfg_read(struct pci_bus *bus,
>         void *mmio_addr;
> 
>         /*
> -        * Map all accesses to the local device (bus == 0) into the
> +        * Map all accesses to the local device on root bus into the
>          * MMIO space of the MAC. Accesses to the downstream devices
>          * go to the PIO space.
>          */
> -       if (busnum == 0) {
> +       if (pci_is_root_bus(bus)) {
>                 if (device == 0) {
>                         /*
>                          * This is the internal downstream P2P bridge,
> @@ -1205,11 +1188,11 @@ static int __devinit tile_cfg_read(struct pci_bus *bus,
>         }
> 
>         /*
> -        * Accesses to the directly attached device (bus == 1) have to be
> +        * Accesses to the directly attached device have to be
>          * sent as type-0 configs.
>          */
> 
> -       if (busnum == 1) {
> +       if (busnum == (controller->first_busno + 1)) {
>                 /*
>                  * There is only one device off of our built-in P2P bridge.
>                  */
> @@ -1303,11 +1286,11 @@ static int __devinit tile_cfg_write(struct pci_bus *bus,
>         u8 val_8 = (u8)val;
> 
>         /*
> -        * Map all accesses to the local device (bus == 0) into the
> +        * Map all accesses to the local device on root bus into the
>          * MMIO space of the MAC. Accesses to the downstream devices
>          * go to the PIO space.
>          */
> -       if (busnum == 0) {
> +       if (pci_is_root_bus(bus)) {
>                 if (device == 0) {
>                         /*
>                          * This is the internal downstream P2P bridge,
> @@ -1336,11 +1319,11 @@ static int __devinit tile_cfg_write(struct pci_bus *bus,
>         }
> 
>         /*
> -        * Accesses to the directly attached device (bus == 1) have to be
> +        * Accesses to the directly attached device have to be
>          * sent as type-0 configs.
>          */
> 
> -       if (busnum == 1) {
> +       if (busnum == (controller->first_busno + 1)) {
>                 /*
>                  * There is only one device off of our built-in P2P bridge.
>                  */
> diff --git a/arch/tile/kernel/setup.c b/arch/tile/kernel/setup.c
> index 2b8b689..ea930ba 100644
> --- a/arch/tile/kernel/setup.c
> +++ b/arch/tile/kernel/setup.c
> @@ -1536,8 +1536,7 @@ static struct resource code_resource = {
> 
>  /*
>   * On Pro, we reserve all resources above 4GB so that PCI won't try to put
> - * mappings above 4GB; the standard allows that for some devices but
> - * the probing code trunates values to 32 bits.
> + * mappings above 4GB.
>   */
>  #if defined(CONFIG_PCI) && !defined(__tilegx__)
>  static struct resource* __init
> 
> -- 
> Chris Metcalf, Tilera Corp.
> http://www.tilera.com
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux