Re: [patch,v2 00/10] make I/O path allocations more numa-friendly

Jeff Moyer <jmoyer@xxxxxxxxxx> · Mon, 12 Nov 2012 16:26:38 -0500

Bart Van Assche <bvanassche@xxxxxxx> writes:

> On 11/09/12 21:46, Jeff Moyer wrote:
>>> On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
>>>> It's certainly better to tie them all to one node then let them be
>>>> randomly scattered across nodes; your 6% observation may simply be
>>>> from that.
>>>>
>>>> How do you think these compare, though (for structures that are per-IO)?
>>>> - tying the structures to the node hosting the storage device
>>>> - tying the structures to the node running the application
>>
>> This is a great question, thanks for asking it!  I went ahead and
>> modified the megaraid_sas driver to take a module parameter that
>> specifies on which node to allocate the scsi_host data structure (and
>> all other structures on top that are tied to that).  I then booted the
>> system 4 times, specifying a different node each time.  Here are the
>> results as compared to a vanilla kernel:
>>
[snip]
> Which NUMA node was processing the megaraid_sas interrupts in these
> tests ? Was irqbalance running during these tests or were interrupts
> manually pinned to a specific CPU core ?

irqbalanced was indeed running, so I can't say for sure what node the
irq was pinned to during my tests (I didn't record that information).

I re-ran the tests, this time turning off irqbalance (well, I set it to
one-shot), and the pinning the irq to the node running the benchmark.
In this configuration, I saw no regressions in performance.

As a reminder:

>> The first number is the percent gain (or loss) w.r.t. the vanilla
>> kernel.  The second number is the standard deviation as a percent of the
>> bandwidth.  So, when data structures are tied to node 0, we see an
>> increase in performance for nodes 0-3.  However, on node 3, which is the
>> node the megaraid_sas controller is attached to, we see no gain in
>> performance, and we see an increase in the run to run variation.  The
>> standard deviation for the vanilla kernel was 1% across all nodes.

Here are the updated numbers:

data structures tied to node 0

application tied to:
node 0:  0 +/-4%
node 1:  9 +/-1%
node 2: 10 +/-2%
node 3:  0 +/-2%

data structures tied to node 1

application tied to:
node 0:  5 +/-2%
node 1:  6 +/-8%
node 2: 10 +/-1%
node 3:  0 +/-3%

data structures tied to node 2

application tied to:
node 0:  6 +/-2%
node 1:  9 +/-2%
node 2:  7 +/-6%
node 3:  0 +/-3%

data structures tied to node 3

application tied to:
node 0:  0 +/-4%
node 1: 10 +/-2%
node 2: 11 +/-1%
node 3:  0 +/-5%

Now, the above is apples to oranges, since the vanilla kernel was run
w/o any tuning of irqs.  So, I went ahead and booted with
numa_node_parm=-1, which is the same as vanilla, and re-ran the tests.

When we compare a vanilla kernel with and without irq binding, we get
this:

node 0:  0 +/-3%
node 1:  9 +/-1%
node 2:  8 +/-3%
node 3:  0 +/-1%

As you can see, binding irqs helps nodes 1 and 2 quite substantially.
What this boils down to, when you compare a patched kernel with the
vanilla kernel, where they are both tying irqs to the node hosting the
application, is a net gain of zero, but an increase in standard
deviation.

Let me try to make that more readable.  The patch set does not appear
to help at all with my benchmark configuration.  ;-)  One other
conclusion I can draw from this data is that irqbalance could do a
better job.

An interesting (to me) tidbit about this hardware is that, while it has
4 numa nodes, it only has 2 sockets.  Based on the numbers above, I'd
guess nodes 0 and 3 are in the same socket, likewise for 1 and 2.

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html