Re: 32 core net-next stack/netfilter "scaling"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 27 Jan 2009 00:10:44 +0100
Eric Dumazet <dada1@xxxxxxxxxxxxx> wrote:

> Rick Jones a écrit :
> > Folks -
> > 
> > Under:
> > 
> > ftp://ftp.netperf.org/iptable_scaling
> > 
> > can be found netperf results and Caliper profiles for three scenarios on
> > a 32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call
> > a "cell based" system in that it is comprised of "cell boards" on which
> > reside CPU and memory resources.  In this case there are four cell
> > boards, each with 4, dual-core Montecito processors and 1/4 of the
> > overall RAM.  The system was configured with a mix of cell-local and
> > global interleaved memory, where the global interleave is on a cacheline
> > (128 byte) boundary (IIRC).  Total RAM in the system is 256 GB.  The
> > cells are joined via cross-bar connections. (numactl --hardware output
> > is available under the URL above)
> > 
> > There was an "I/O expander" connected to the system.  This meant there
> > were as many distinct PCI-X domains as there were cells, and every cell
> > had a "local" set of PCI-X slots.
> > 
> > Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs -
> > aka Neterion XFrame IIs.  These were then connected to an HP ProCurve
> > 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP
> > DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet
> > NICs (Aka Chelsio T3C-based).  They were running RHEL 5.2 I think.  Each
> > NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble
> > x8 slot (DL585 G5)
> > 
> > The kernel is from DaveM's net-next tree ca last week, multiq enabled. 
> > The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get
> > multiq support.  It was loaded into the kernel via:
> > 
> > insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
> > 
> > There were then 8 tx queues and 8 rx queues per interface in the
> > rx8640.  The "setaffinity.txt" script was used to set the IRQ affinities
> > to cores "closest" to the physical NIC. In all three tests all 32 cores
> > went to 100% utilization. At least for all incense and porpoises. (there
> > was some occasional idle reported by top on the full_iptables run)
> > 
> > A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a
> > burst mode of 17 were run (ie 17 "transactions" outstanding on a
> > connection at one time,) with TCP_NODELAY set and the results gathered,
> > along with a set of Caliper profiles.  The script used to launch these
> > can be found in "runemomniagg2.sh.txt under the URL above.
> > 
> > I picked an "RR" test to maximize the trips up and down the stack while
> > minimizing the bandwidth consumed.
> > 
> > I picked a burst size of 16 because that was sufficient to saturate a
> > single core on the rx8640.
> > 
> > I picked 64 concurrent netperfs because I wanted to make sure I had
> > enough concurrent connections to get spread across all the cores/queues
> > by the algorithms in place.
> > 
> > I picked the combination of 64 and 16 rather than say 1024 and 0 (one
> > tran at a time) because I didn't want to run a context switching
> > benchmark :)
> > 
> > The rx8640 was picked because it was available and I was confident it
> > was not going to have any hardware scaling issues getting in the way.  I
> > wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is
> > a reasonable analog for any "decent or better scaling" 32 core hardware
> > and that while there are ia64-specific routines present in the profiles,
> > they are there for platform-independent reasons.
> > 
> > The no_iptables/ data was run after a fresh boot, with no iptables
> > commands run and so no iptables related modules loaded into the kernel.
> > 
> > The empty_iptables/ data was run after an "iptables --list" command
> > which loaded one or two modules into the kernel.
> > 
> > The full_iptables/ data was run after an "iptables-restore" command
> > pointed at full_iptables/iptables.txt  which was created from what RH
> > creates by default when one enables firewall via their installer, with a
> > port range added by me to allow pretty much anything netperf would ask. 
> > As such, while it does excercise netfilter functionality, I cannot make
> > any claims as to its "real world" applicability.  (while the firewall
> > settings came from an RH setup, FWIW, the base bits running on the
> > rx8640 are Debian Lenny, with the net-next kernel on top)
> > 
> > The "cycles" profile is able to grab flat profile hits while interrupts
> > are disabled so it can see stuff happening while interrupts are
> > disabled.  The "scgprof" profile is an attempt to get some call graphs -
> > it does not have visibility into code running with interrupts disabled. 
> > The "cache" profile is a profile that looks to get some cache miss
> > information.
> > 
> > So, having said all that, details can be found under the previously
> > mentioned URL.  Some quick highlights:
> > 
> > no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >    5.70     5.70         37772   s2io.ko::tx_intr_handler
> >    5.14    10.84         34012   vmlinux::__ia64_readq
> >    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
> >    4.63    20.34         30625   s2io.ko::rx_intr_handler
> >    4.60    24.94         30429   s2io.ko::s2io_xmit
> >    3.85    28.79         25488   s2io.ko::s2io_poll_msix
> >    2.87    31.65         18987   vmlinux::dev_queue_xmit
> >    2.51    34.16         16620   vmlinux::tcp_sendmsg
> >    2.51    36.67         16588   vmlinux::tcp_ack
> >    2.15    38.82         14221   vmlinux::__inet_lookup_established
> >    2.10    40.92         13937   vmlinux::ia64_spinlock_contention
> > 
> > empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles
> > profile looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   26.38    26.38        137458   vmlinux::_read_lock_bh
> >   10.63    37.01         55388   vmlinux::local_bh_enable_ip
> >    3.42    40.43         17812   s2io.ko::tx_intr_handler
> >    3.01    43.44         15691   ip_tables.ko::ipt_do_table
> >    2.90    46.34         15100   vmlinux::__ia64_readq
> >    2.72    49.06         14179   s2io.ko::rx_intr_handler
> >    2.55    51.61         13288   s2io.ko::s2io_xmit
> >    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
> >    1.75    55.34          9104   vmlinux::dev_queue_xmit
> >    1.64    56.98          8546   s2io.ko::s2io_poll_msix
> >    1.52    58.50          7943   vmlinux::sock_wfree
> >    1.40    59.91          7302   vmlinux::tcp_ack
> > 
> > full_iptables - some test instances didn't complete, I think they got
> > starved. Of those which did complete, their performance ranged all the
> > way from 330 to 3100 transactions/s/netperf.  Top of the cycles profile
> > looks like:
> > 
> > Function Summary
> > -----------------------------------------------------------------------
> > % Total
> >      IP  Cumulat             IP
> > Samples    % of         Samples
> >  (ETB)     Total         (ETB)   Function                          File
> > -----------------------------------------------------------------------
> >   64.71    64.71        582171   vmlinux::_write_lock_bh
> >   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
> >    2.86    85.99         25709   nf_conntrack.ko::init_module
> >    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
> >    1.78    90.13         16009   vmlinux::_spin_lock_bh
> >    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
> >    1.20    92.52         10755   vmlinux::nf_iterate
> >    1.09    93.62          9833   vmlinux::default_idle
> >    0.26    93.88          2331   vmlinux::__ia64_readq
> >    0.25    94.12          2213   vmlinux::__interrupt
> >    0.24    94.37          2203   s2io.ko::tx_intr_handler
> > 
> > Suggestions as to things to look at/with and/or patches to try are
> > welcome.  I should have the HW available to me for at least a little
> > while, but not indefinitely.
> > 
> > rick jones
> 
> Hi Rick, nice hardware you have :)
> 
> Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
> I hit this contention point even with low cost hardware, and quite standard application.
> 
> I pinged him few days ago to try to finish the job with him, but it seems Stephen
> is busy at the moment.
> 
> Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
>  that must be write_locked() for basically every handled tcp frame...
> 
> How long is "not indefinitely" ? 

Hey, I just got back from Linux Conf Au, haven't had time to catch up yet.
It is on my list, after dealing with the other work related stuff.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Netfitler Users]     [LARTC]     [Bugtraq]     [Yosemite Forum]

  Powered by Linux