On 11/09/12 21:46, Jeff Moyer wrote:
On 11/06/12 16:41, Elliott, Robert (Server Storage) wrote:
It's certainly better to tie them all to one node then let them be
randomly scattered across nodes; your 6% observation may simply be
from that.
How do you think these compare, though (for structures that are per-IO)?
- tying the structures to the node hosting the storage device
- tying the structures to the node running the application
This is a great question, thanks for asking it! I went ahead and
modified the megaraid_sas driver to take a module parameter that
specifies on which node to allocate the scsi_host data structure (and
all other structures on top that are tied to that). I then booted the
system 4 times, specifying a different node each time. Here are the
results as compared to a vanilla kernel:
data structures tied to node 0
application tied to:
node 0: +6% +/-1%
node 1: +9% +/-2%
node 2: +10% +/-3%
node 3: +0% +/-4%
The first number is the percent gain (or loss) w.r.t. the vanilla
kernel. The second number is the standard deviation as a percent of the
bandwidth. So, when data structures are tied to node 0, we see an
increase in performance for nodes 0-3. However, on node 3, which is the
node the megaraid_sas controller is attached to, we see no gain in
performance, and we see an increase in the run to run variation. The
standard deviation for the vanilla kernel was 1% across all nodes.
Given that the results are mixed, depending on which node the workload
is running, I can't really draw any conclusions from this. The node 3
number is really throwing me for a loop. If it were positive, I'd do
some handwaving about all data structures getting allocated one node 0
at boot, and the addition of getting the scsi_cmnd structure on the same
node is what resulted in the net gain.
data structures tied to node 1
application tied to:
node 0: +6% +/-1%
node 1: +0% +/-2%
node 2: +0% +/-6%
node 3: -7% +/-13%
Now this is interesting! Tying data structures to node 1 results in a
performance boost for node 0? That would seem to validate your question
of whether it just helps out to have everything come from the same node,
as opposed to allocated close to the storage controller. However, node
3 sees a decrease in performance, and a huge standard devation. Node 2
also sees an increased standard deviation. That leaves me wondering why
node 0 didn't also experience an increase....
data structures tied to node 2
application tied to:
node 0: +5% +/-3%
node 1: +0% +/-5%
node 2: +0% +/-4%
node 3: +0% +/-5%
Here, we *mostly* just see an increase in standard deviation, with no
appreciable change in application performance.
data structures tied to node 3
application tied to:
node 0: +0% +/-6%
node 1: +6% +/-4%
node 2: +7% +/-4%
node 3: +0% +/-4%
Now, this is the case where I'd expect to see the best performance,
since the HBA is on node 3. However, that's not what we get! Instead,
we get maybe a couple percent improvement on nodes 1 and 2, and an
increased run-to-run variation for nodes 0 and 3.
Overall, I'd say that my testing is inconclusive, and I may just pull
the patch set until I can get some reasonable results.
Which NUMA node was processing the megaraid_sas interrupts in these
tests ? Was irqbalance running during these tests or were interrupts
manually pinned to a specific CPU core ?
Thanks,
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html