Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Chris Wright wrote:
* Mike Travis (travis@xxxxxxx) wrote:
Chris Wright wrote:
* Mike Travis (travis@xxxxxxx) wrote:
   When the IOMMU is being used, each request for a DMA mapping requires
   the intel_iommu code to look for some space in the DMA mapping table.
   For most drivers this occurs for each transfer.

   When there are many outstanding DMA mappings [as seems to be the case
   with the 10GigE driver], the table grows large and the search for
   space becomes increasingly time consuming.  Performance for the
   10GigE driver drops to about 10% of it's capacity on a UV system
   when the CPU count is large.
That's pretty poor.  I've seen large overheads, but when that big it was
also related to issues in the 10G driver.  Do you have profile data
showing this as the hotspot?
Here's one from our internal bug report:

Here is a profile from a run with iommu=on  iommu=pt  (no forcedac)

OK, I was actually interested in the !pt case.  But this is useful
still.  The iova lookup being distinct from the identity_mapping() case.

I can get that as well, but having every device using maps caused it's
own set of problems (hundreds of dma maps).  Here's a list of devices
on the system under test.  You can see that even 'minor' glitches can
get magnified when there are so many...

Blade Location    NASID  PCI Address X Display   Device
----------------------------------------------------------------------
   0 r001i01b00      0  0000:01:00.0      -   Intel 82576 Gigabit Network Connection
   .          .      .  0000:01:00.1      -   Intel 82576 Gigabit Network Connection
   .          .      .  0000:04:00.0      -   LSI SAS1064ET Fusion-MPT SAS
   .          .      .  0000:05:00.0      -   Matrox MGA G200e
   2 r001i01b02      4  0001:02:00.0      -   Mellanox MT26428 InfiniBand
   3 r001i01b03      6  0002:02:00.0      -   Mellanox MT26428 InfiniBand
   4 r001i01b04      8  0003:02:00.0      -   Mellanox MT26428 InfiniBand
  11 r001i01b11     22  0007:02:00.0      -   Mellanox MT26428 InfiniBand
  13 r001i01b13     26  0008:02:00.0      -   Mellanox MT26428 InfiniBand
  15 r001i01b15     30  0009:07:00.0   :0.0   nVidia GF100 [Tesla S2050]
   .          .      .  0009:08:00.0   :1.1   nVidia GF100 [Tesla S2050]
  18 r001i23b02     36  000b:02:00.0      -   Mellanox MT26428 InfiniBand
  20 r001i23b04     40  000c:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  000c:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  000c:04:00.0      -   Mellanox MT26428 InfiniBand
  23 r001i23b07     46  000d:07:00.0      -   nVidia GF100 [Tesla S2050]
   .          .      .  000d:08:00.0      -   nVidia GF100 [Tesla S2050]
  25 r001i23b09     50  000e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  000e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  000e:04:00.0      -   Mellanox MT26428 InfiniBand
  26 r001i23b10     52  000f:02:00.0      -   Mellanox MT26428 InfiniBand
  27 r001i23b11     54  0010:02:00.0      -   Mellanox MT26428 InfiniBand
  29 r001i23b13     58  0011:02:00.0      -   Mellanox MT26428 InfiniBand
  31 r001i23b15     62  0012:02:00.0      -   Mellanox MT26428 InfiniBand
  34 r002i01b02     68  0013:01:00.0      -   Mellanox MT26428 InfiniBand
  35 r002i01b03     70  0014:02:00.0      -   Mellanox MT26428 InfiniBand
  36 r002i01b04     72  0015:01:00.0      -   Mellanox MT26428 InfiniBand
  41 r002i01b09     82  0018:07:00.0      -   nVidia GF100 [Tesla S2050]
   .          .      .  0018:08:00.0      -   nVidia GF100 [Tesla S2050]
  43 r002i01b11     86  0019:01:00.0      -   Mellanox MT26428 InfiniBand
  45 r002i01b13     90  001a:01:00.0      -   Mellanox MT26428 InfiniBand
  48 r002i23b00     96  001c:07:00.0      -   nVidia GF100 [Tesla S2050]
   .          .      .  001c:08:00.0      -   nVidia GF100 [Tesla S2050]
  50 r002i23b02    100  001d:02:00.0      -   Mellanox MT26428 InfiniBand
  52 r002i23b04    104  001e:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  001e:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  001e:04:00.0      -   Mellanox MT26428 InfiniBand
  57 r002i23b09    114  0020:01:00.0      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  0020:01:00.1      -   Intel 82599EB 10-Gigabit Network Connection
   .          .      .  0020:04:00.0      -   Mellanox MT26428 InfiniBand
  58 r002i23b10    116  0021:02:00.0      -   Mellanox MT26428 InfiniBand
  59 r002i23b11    118  0022:02:00.0      -   Mellanox MT26428 InfiniBand
  61 r002i23b13    122  0023:02:00.0      -   Mellanox MT26428 InfiniBand
  63 r002i23b15    126  0024:02:00.0      -   Mellanox MT26428 InfiniBand


uv48-sys was receiving and uv-debug sending.
ksoftirqd/640 was running at approx. 100% cpu utilization.
I had pinned the nttcp process on uv48-sys to cpu 64.

# Samples: 1255641
#
# Overhead        Command  Shared Object  Symbol
# ........  .............  .............  ......
#
   50.27%ESC[m  ksoftirqd/640  [kernel]       [k] _spin_lock
   27.43%ESC[m  ksoftirqd/640  [kernel]       [k] iommu_no_mapping

...
     0.48%  ksoftirqd/640  [kernel]       [k] iommu_should_identity_map
     0.45%  ksoftirqd/640  [kernel]       [k] ixgbe_alloc_rx_buffers    [
ixgbe]

Note, ixgbe has had rx dma mapping issues (that's why I wondered what
was causing the massive slowdown under !pt mode).

I think since this profile run, the network guys updated the ixgbe
driver with a later version.  (I don't know the outcome of that test.)


<snip>
I tracked this time down to identity_mapping() in this loop:

      list_for_each_entry(info, &si_domain->devices, link)
              if (info->dev == pdev)
                      return 1;

I didn't get the exact count, but there was approx 11,000 PCI devices
on this system.  And this function was called for every page request
in each DMA request.

Right, so this is the list traversal (and wow, a lot of PCI devices).

Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
Also, there's a ton of bridges as well.

Did you try a smarter data structure? (While there's room for another
bit in pci_dev, the bit is more about iommu implementation details than
anything at the pci level).

Or the domain_dev_info is cached in the archdata of device struct.
You should be able to just reference that directly.

Didn't think it through completely, but perhaps something as simple as:

	return pdev->dev.archdata.iommu == si_domain;

I can try this, thanks!


thanks,
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux