"Elliott, Robert (Server Storage)" <Elliott@xxxxxx> writes: > What do these commands report about the NUMA and non-uniform IO topology on the test system? This is a DELL PowerEdge R715. See chapter 7 of this document for details on how the I/O bridges are connected: http://www.dell.com/downloads/global/products/pedge/en/Poweredge-r715-technicalguide.pdf > numactl --hardware # numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 2 4 6 node 0 size: 8182 MB node 0 free: 7856 MB node 1 cpus: 8 10 12 14 node 1 size: 8192 MB node 1 free: 8008 MB node 2 cpus: 9 11 13 15 node 2 size: 8192 MB node 2 free: 7994 MB node 3 cpus: 1 3 5 7 node 3 size: 8192 MB node 3 free: 7982 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 > lspci -t # lspci -vt -+-[0000:20]-+-00.0 ATI Technologies Inc RD890 Northbridge only dual slot (2x8) PCI-e GFX Hydra part | +-02.0-[21]-- | +-03.0-[22]----00.0 LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] | \-0b.0-[23]--+-00.0 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection | \-00.1 Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection \-[0000:00]-+-00.0 ATI Technologies Inc RD890 PCI to PCI bridge (external gfx0 port A) +-02.0-[01]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet | \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet +-03.0-[02]--+-00.0 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet | \-00.1 Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet +-04.0-[03-08]----00.0-[04-08]--+-00.0-[05]----00.0 LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] +-09.0-[09]-- +-12.0 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-12.1 ATI Technologies Inc SB7x0 USB OHCI1 Controller +-12.2 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller +-13.0 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-13.1 ATI Technologies Inc SB7x0 USB OHCI1 Controller +-13.2 ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller +-14.0 ATI Technologies Inc SBx00 SMBus Controller +-14.3 ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller +-14.4-[0a]----03.0 Matrox Graphics, Inc. MGA G200eW WPCM450 +-18.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration +-18.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map +-18.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller +-18.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control +-18.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control +-19.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration +-19.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map +-19.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller +-19.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control +-19.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control +-1a.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration +-1a.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map +-1a.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller +-1a.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control +-1a.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control +-1b.0 Advanced Micro Devices [AMD] Family 10h Processor HyperTransport Configuration +-1b.1 Advanced Micro Devices [AMD] Family 10h Processor Address Map +-1b.2 Advanced Micro Devices [AMD] Family 10h Processor DRAM Controller +-1b.3 Advanced Micro Devices [AMD] Family 10h Processor Miscellaneous Control \-1b.4 Advanced Micro Devices [AMD] Family 10h Processor Link Control # cat /sys/bus/pci/devices/0000\:20\:03.0/0000\:22\:00.0/numa_node 3 -Jeff > > >> -----Original Message----- >> From: Jeff Moyer [mailto:jmoyer@xxxxxxxxxx] >> Sent: Monday, 12 November, 2012 3:27 PM >> To: Bart Van Assche >> Cc: Elliott, Robert (Server Storage); linux-scsi@xxxxxxxxxxxxxxx >> Subject: Re: [patch,v2 00/10] make I/O path allocations more numa-friendly >> >> Bart Van Assche <bvanassche@xxxxxxx> writes: >> >> > On 11/09/12 21:46, Jeff Moyer wrote: >> >>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote: >> >>>> It's certainly better to tie them all to one node then let them be >> >>>> randomly scattered across nodes; your 6% observation may simply be >> >>>> from that. >> >>>> >> >>>> How do you think these compare, though (for structures that are per-IO)? >> >>>> - tying the structures to the node hosting the storage device >> >>>> - tying the structures to the node running the application >> >> >> >> This is a great question, thanks for asking it! I went ahead and >> >> modified the megaraid_sas driver to take a module parameter that >> >> specifies on which node to allocate the scsi_host data structure (and >> >> all other structures on top that are tied to that). I then booted the >> >> system 4 times, specifying a different node each time. Here are the >> >> results as compared to a vanilla kernel: >> >> >> [snip] >> > Which NUMA node was processing the megaraid_sas interrupts in these >> > tests ? Was irqbalance running during these tests or were interrupts >> > manually pinned to a specific CPU core ? >> >> irqbalanced was indeed running, so I can't say for sure what node the >> irq was pinned to during my tests (I didn't record that information). >> >> I re-ran the tests, this time turning off irqbalance (well, I set it to >> one-shot), and the pinning the irq to the node running the benchmark. >> In this configuration, I saw no regressions in performance. >> >> As a reminder: >> >> >> The first number is the percent gain (or loss) w.r.t. the vanilla >> >> kernel. The second number is the standard deviation as a percent of the >> >> bandwidth. So, when data structures are tied to node 0, we see an >> >> increase in performance for nodes 0-3. However, on node 3, which is the >> >> node the megaraid_sas controller is attached to, we see no gain in >> >> performance, and we see an increase in the run to run variation. The >> >> standard deviation for the vanilla kernel was 1% across all nodes. >> >> Here are the updated numbers: >> >> data structures tied to node 0 >> >> application tied to: >> node 0: 0 +/-4% >> node 1: 9 +/-1% >> node 2: 10 +/-2% >> node 3: 0 +/-2% >> >> data structures tied to node 1 >> >> application tied to: >> node 0: 5 +/-2% >> node 1: 6 +/-8% >> node 2: 10 +/-1% >> node 3: 0 +/-3% >> >> data structures tied to node 2 >> >> application tied to: >> node 0: 6 +/-2% >> node 1: 9 +/-2% >> node 2: 7 +/-6% >> node 3: 0 +/-3% >> >> data structures tied to node 3 >> >> application tied to: >> node 0: 0 +/-4% >> node 1: 10 +/-2% >> node 2: 11 +/-1% >> node 3: 0 +/-5% >> >> Now, the above is apples to oranges, since the vanilla kernel was run >> w/o any tuning of irqs. So, I went ahead and booted with >> numa_node_parm=-1, which is the same as vanilla, and re-ran the tests. >> >> When we compare a vanilla kernel with and without irq binding, we get >> this: >> >> node 0: 0 +/-3% >> node 1: 9 +/-1% >> node 2: 8 +/-3% >> node 3: 0 +/-1% >> >> As you can see, binding irqs helps nodes 1 and 2 quite substantially. >> What this boils down to, when you compare a patched kernel with the >> vanilla kernel, where they are both tying irqs to the node hosting the >> application, is a net gain of zero, but an increase in standard >> deviation. >> >> Let me try to make that more readable. The patch set does not appear >> to help at all with my benchmark configuration. ;-) One other >> conclusion I can draw from this data is that irqbalance could do a >> better job. >> >> An interesting (to me) tidbit about this hardware is that, while it has >> 4 numa nodes, it only has 2 sockets. Based on the numbers above, I'd >> guess nodes 0 and 3 are in the same socket, likewise for 1 and 2. >> >> Cheers, >> Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html