Folks -
Under:
ftp://ftp.netperf.org/iptable_scaling
can be found netperf results and Caliper profiles for three scenarios on a
32-core, 1.6 GHz 'Montecito' rx8640 system. An rx8640 is what HP call a "cell
based" system in that it is comprised of "cell boards" on which reside CPU and
memory resources. In this case there are four cell boards, each with 4,
dual-core Montecito processors and 1/4 of the overall RAM. The system was
configured with a mix of cell-local and global interleaved memory, where the
global interleave is on a cacheline (128 byte) boundary (IIRC). Total RAM in the
system is 256 GB. The cells are joined via cross-bar connections. (numactl
--hardware output is available under the URL above)
There was an "I/O expander" connected to the system. This meant there were as
many distinct PCI-X domains as there were cells, and every cell had a "local" set
of PCI-X slots.
Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs - aka
Neterion XFrame IIs. These were then connected to an HP ProCurve 5806 switch,
which was in turn connected to three, 4P/16C, 2.3 GHz HP DL585 G5s, each of which
had a pair of HP AD386A PCIe 10Gbit Ethernet NICs (Aka Chelsio T3C-based). They
were running RHEL 5.2 I think. Each NIC was in either a PCI-X 2.0 266 MHz slot
(rx8640) or a PCIe 1.mumble x8 slot (DL585 G5)
The kernel is from DaveM's net-next tree ca last week, multiq enabled. The s2io
driver is Neterion's out-of-tree version 2.0.36.15914 to get multiq support. It
was loaded into the kernel via:
insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
There were then 8 tx queues and 8 rx queues per interface in the rx8640. The
"setaffinity.txt" script was used to set the IRQ affinities to cores "closest" to
the physical NIC. In all three tests all 32 cores went to 100% utilization. At
least for all incense and porpoises. (there was some occasional idle reported by
top on the full_iptables run)
A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a burst
mode of 17 were run (ie 17 "transactions" outstanding on a connection at one
time,) with TCP_NODELAY set and the results gathered, along with a set of Caliper
profiles. The script used to launch these can be found in "runemomniagg2.sh.txt
under the URL above.
I picked an "RR" test to maximize the trips up and down the stack while
minimizing the bandwidth consumed.
I picked a burst size of 16 because that was sufficient to saturate a single core
on the rx8640.
I picked 64 concurrent netperfs because I wanted to make sure I had enough
concurrent connections to get spread across all the cores/queues by the
algorithms in place.
I picked the combination of 64 and 16 rather than say 1024 and 0 (one tran at a
time) because I didn't want to run a context switching benchmark :)
The rx8640 was picked because it was available and I was confident it was not
going to have any hardware scaling issues getting in the way. I wanted to see SW
issues, not HW issues. I am ass-u-me-ing the rx8640 is a reasonable analog for
any "decent or better scaling" 32 core hardware and that while there are
ia64-specific routines present in the profiles, they are there for
platform-independent reasons.
The no_iptables/ data was run after a fresh boot, with no iptables commands run
and so no iptables related modules loaded into the kernel.
The empty_iptables/ data was run after an "iptables --list" command which loaded
one or two modules into the kernel.
The full_iptables/ data was run after an "iptables-restore" command pointed at
full_iptables/iptables.txt which was created from what RH creates by default
when one enables firewall via their installer, with a port range added by me to
allow pretty much anything netperf would ask. As such, while it does excercise
netfilter functionality, I cannot make any claims as to its "real world"
applicability. (while the firewall settings came from an RH setup, FWIW, the
base bits running on the rx8640 are Debian Lenny, with the net-next kernel on top)
The "cycles" profile is able to grab flat profile hits while interrupts are
disabled so it can see stuff happening while interrupts are disabled. The
"scgprof" profile is an attempt to get some call graphs - it does not have
visibility into code running with interrupts disabled. The "cache" profile is a
profile that looks to get some cache miss information.
So, having said all that, details can be found under the previously mentioned
URL. Some quick highlights:
no_iptables - ~22000 transactions/s/netperf. Top of the cycles profile looks like:
Function Summary
-----------------------------------------------------------------------
% Total
IP Cumulat IP
Samples % of Samples
(ETB) Total (ETB) Function File
-----------------------------------------------------------------------
5.70 5.70 37772 s2io.ko::tx_intr_handler
5.14 10.84 34012 vmlinux::__ia64_readq
4.88 15.72 32285 s2io.ko::s2io_msix_ring_handle
4.63 20.34 30625 s2io.ko::rx_intr_handler
4.60 24.94 30429 s2io.ko::s2io_xmit
3.85 28.79 25488 s2io.ko::s2io_poll_msix
2.87 31.65 18987 vmlinux::dev_queue_xmit
2.51 34.16 16620 vmlinux::tcp_sendmsg
2.51 36.67 16588 vmlinux::tcp_ack
2.15 38.82 14221 vmlinux::__inet_lookup_established
2.10 40.92 13937 vmlinux::ia64_spinlock_contention
empty_iptables - ~12000 transactions/s/netperf. Top of the cycles profile looks
like:
Function Summary
-----------------------------------------------------------------------
% Total
IP Cumulat IP
Samples % of Samples
(ETB) Total (ETB) Function File
-----------------------------------------------------------------------
26.38 26.38 137458 vmlinux::_read_lock_bh
10.63 37.01 55388 vmlinux::local_bh_enable_ip
3.42 40.43 17812 s2io.ko::tx_intr_handler
3.01 43.44 15691 ip_tables.ko::ipt_do_table
2.90 46.34 15100 vmlinux::__ia64_readq
2.72 49.06 14179 s2io.ko::rx_intr_handler
2.55 51.61 13288 s2io.ko::s2io_xmit
1.98 53.59 10329 s2io.ko::s2io_msix_ring_handle
1.75 55.34 9104 vmlinux::dev_queue_xmit
1.64 56.98 8546 s2io.ko::s2io_poll_msix
1.52 58.50 7943 vmlinux::sock_wfree
1.40 59.91 7302 vmlinux::tcp_ack
full_iptables - some test instances didn't complete, I think they got starved.
Of those which did complete, their performance ranged all the way from 330 to
3100 transactions/s/netperf. Top of the cycles profile looks like:
Function Summary
-----------------------------------------------------------------------
% Total
IP Cumulat IP
Samples % of Samples
(ETB) Total (ETB) Function File
-----------------------------------------------------------------------
64.71 64.71 582171 vmlinux::_write_lock_bh
18.43 83.14 165822 vmlinux::ia64_spinlock_contention
2.86 85.99 25709 nf_conntrack.ko::init_module
2.36 88.35 21194 nf_conntrack.ko::tcp_packet
1.78 90.13 16009 vmlinux::_spin_lock_bh
1.20 91.33 10810 nf_conntrack.ko::nf_conntrack_in
1.20 92.52 10755 vmlinux::nf_iterate
1.09 93.62 9833 vmlinux::default_idle
0.26 93.88 2331 vmlinux::__ia64_readq
0.25 94.12 2213 vmlinux::__interrupt
0.24 94.37 2203 s2io.ko::tx_intr_handler
Suggestions as to things to look at/with and/or patches to try are welcome. I
should have the HW available to me for at least a little while, but not indefinitely.
rick jones
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html