Re: [PATCH] net: add Documentation/networking/scaling.txt

Randy Dunlap <rdunlap@xxxxxxxxxxxx> · Mon, 1 Aug 2011 11:41:42 -0700

On Sun, 31 Jul 2011 23:56:26 -0700 (PDT) Tom Herbert wrote:

> Describes RSS, RPS, RFS, accelerated RFS, and XPS.
> 
> Signed-off-by: Tom Herbert <therbert@xxxxxxxxxx>
> ---
>  Documentation/networking/scaling.txt |  346 ++++++++++++++++++++++++++++++++++
>  1 files changed, 346 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/networking/scaling.txt
> 
> diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
> new file mode 100644
> index 0000000..aa51f0f
> --- /dev/null
> +++ b/Documentation/networking/scaling.txt
> @@ -0,0 +1,346 @@
> +Scaling in the Linux Networking Stack
> +
> +
> +Introduction
> +============ 
> +
> +This document describes a set of complementary techniques in the Linux
> +networking stack to increase parallelism and improve performance (in
> +throughput, latency, CPU utilization, etc.) for multi-processor systems.
> +
> +The following technologies are described:
> +
> +  RSS: Receive Side Scaling
> +  RPS: Receive Packet Steering
> +  RFS: Receive Flow Steering
> +  Accelerated Receive Flow Steering
> +  XPS: Transmit Packet Steering
> +
> +
> +RSS: Receive Side Scaling
> +=========================
> +
> +Contemporary NICs support multiple receive queues (multi-queue), which
> +can be used to distribute packets amongst CPUs for processing. The NIC
> +distributes packets by applying a filter to each packet to assign it to
> +one of a small number of logical flows.  Packets for each flow are
> +steered to a separate receive queue, which in turn can be processed by
> +separate CPUs.  This mechanism is generally known as “Receive-side
> +Scaling” (RSS).
> +
> +The filter used in RSS is typically a hash function over the network or
> +transport layer headers-- for example, a 4-tuple hash over IP addresses
> +and TCP ports of a packet. The most common hardware implementation of
> +RSS uses a 128 entry indirection table where each entry stores a queue

              128-entry

> +number. The receive queue for a packet is determined by masking out the
> +low order seven bits of the computed hash for the packet (usually a
> +Toeplitz hash), taking this number as a key into the indirection table
> +and reading the corresponding value.
> +
> +Some advanced NICs allow steering packets to queues based on
> +programmable filters. For example, webserver bound TCP port 80 packets
> +can be directed to their own receive queue. Such “n-tuple” filters can
> +be configured from ethtool (--config-ntuple).
> +
> +== RSS Configuration
> +
> +The driver for a multi-queue capable NIC typically provides a module
> +parameter specifying the number of hardware queues to configure. In the
> +bnx2x driver, for instance, this parameter is called num_queues. A
> +typical RSS configuration would be to have one receive queue for each
> +CPU if the device supports enough queues, or otherwise at least one for
> +each cache domain at a particular cache level (L1, L2, etc.).
> +
> +The indirection table of an RSS device, which resolves a queue by masked
> +hash, is usually programmed by the driver at initialization.  The
> +default mapping is to distribute the queues evenly in the table, but the
> +indirection table can be retrieved and modified at runtime using ethtool
> +commands (--show-rxfh-indir and --set-rxfh-indir).  Modifying the
> +indirection table could be done to to give different queues different

                                   ^^drop one "to"

> +relative weights. 

Drop trailing whitespace above and anywhere else that it's found. (5 places)

I thought (long ago :) that multiple RX queues were for prioritizing traffic,
but there is nothing here about using multi-queues for priorities.
Is that (no longer) done?

> +
> +== RSS IRQ Configuration
> +
> +Each receive queue has a separate IRQ associated with it. The NIC
> +triggers this to notify a CPU when new packets arrive on the given
> +queue. The signaling path for PCIe devices uses message signaled
> +interrupts (MSI-X), that can route each interrupt to a particular CPU.
> +The active mapping of queues to IRQs can be determined from
> +/proc/interrupts. By default, all IRQs are routed to CPU0.  Because a
> +non-negligible part of packet processing takes place in receive
> +interrupt handling, it is advantageous to spread receive interrupts
> +between CPUs. To manually adjust the IRQ affinity of each interrupt see
> +Documentation/IRQ-affinity. On some systems, the irqbalance daemon is
> +running and will try to dynamically optimize this setting.

or (avoid a split infinitive):  will try to optimize this setting dynamically.

> +
> +
> +RPS: Receive Packet Steering
> +============================
> +
> +Receive Packet Steering (RPS) is logically a software implementation of
> +RSS.  Being in software, it is necessarily called later in the datapath.
> +Whereas RSS selects the queue and hence CPU that will run the hardware
> +interrupt handler, RPS selects the CPU to perform protocol processing
> +above the interrupt handler.  This is accomplished by placing the packet
> +on the desired CPU’s backlog queue and waking up the CPU for processing.
> +RPS has some advantages over RSS: 1) it can be used with any NIC, 2)
> +software filters can easily be added to handle new protocols, 3) it does
> +not increase hardware device interrupt rate (but does use IPIs).
> +
> +RPS is called during bottom half of the receive interrupt handler, when
> +a driver sends a packet up the network stack with netif_rx() or
> +netif_receive_skb(). These call the get_rps_cpu() function, which
> +selects the queue that should process a packet.
> +
> +The first step in determining the target CPU for RPS is to calculate a
> +flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
> +depending on the protocol). This serves as a consistent hash of the
> +associated flow of the packet. The hash is either provided by hardware
> +or will be computed in the stack. Capable hardware can pass the hash in
> +the receive descriptor for the packet, this would usually be the same

                                  packet;

> +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
> +skb->rx_hash and can be used elsewhere in the stack as a hash of the
> +packet’s flow.
> +
> +Each receive hardware qeueue has associated list of CPUs which can

                                has an associated list (?)

> +process packets received on the queue for RPS.  For each received
> +packet, an index into the list is computed from the flow hash modulo the
> +size of the list.  The indexed CPU is the target for processing the
> +packet, and the packet is queued to the tail of that CPU’s backlog
> +queue. At the end of the bottom half routine, inter-processor interrupts
> +(IPIs) are sent to any CPUs for which packets have been queued to their
> +backlog queue. The IPI wakes backlog processing on the remote CPU, and
> +any queued packets are then processed up the networking stack. Note that
> +the list of CPUs can be configured separately for each hardware receive
> +queue.
> +
> +== RPS Configuration
> +
> +RPS requires a kernel compiled with the CONFIG_RPS flag (on by default

s/flag/kconfig symbol/

> +for smp). Even when compiled in, it is disabled without any

   for SMP).

> +configuration. The list of CPUs to which RPS may forward traffic can be
> +configured for each receive queue using the sysfs file entry:
> +
> + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
> +
> +This file implements a bitmap of CPUs. RPS is disabled when it is zero
> +(the default), in which case packets are processed on the interrupting
> +CPU.  IRQ-affinity.txt explains how CPUs are assigned to the bitmap.
> +
> +For a single queue device, a typical RPS configuration would be to set
> +the rps_cpus to the CPUs in the same cache domain of the interrupting
> +CPU for a queue. If NUMA locality is not an issue, this could also be
> +all CPUs in the system. At high interrupt rate, it might wise to exclude

                                                   it might be wise

> +the interrupting CPU from the map since that already performs much work.
> +
> +For a multi-queue system, if RSS is configured so that a receive queue
> +is mapped to each CPU, then RPS is probably redundant and unnecessary.
> +If there are fewer queues than CPUs, then RPS might be beneficial if the
> +rps_cpus for each queue are the ones that share the same cache domain as
> +the interrupting CPU for the queue.
> +
> +RFS: Receive Flow Steering
> +==========================
> +
> +While RPS steers packet solely based on hash, and thus generally

             steers packets

> +provides good load distribution, it does not take into account
> +application locality. This is accomplished by Receive Flow Steering
> +(RFS). The goal of RFS is to increase datacache hitrate by steering
> +kernel processing of packets to the CPU where the application thread
> +consuming the packet is running. RFS relies on the same RPS mechanisms
> +to enqueue packets onto the backlog of another CPU and to wake that CPU.
> +
> +In RFS, packets are not forwarded directly by the value of their hash,
> +but the hash is used as index into a flow lookup table. This table maps
> +flows to the CPUs where those flows are being processed. The flow hash
> +(see RPS section above) is used to calculate the index into this table.
> +The CPU recorded in each entry is the one which last processed the flow,
> +and if there is not a valid CPU for an entry, then packets mapped to
> +that entry are steered using plain RPS.
> +
> +To avoid out of order packets (ie. when scheduler moves a thread with

                                 (i.e., when the scheduler moves a thread that

> +outstanding receive packets on) there are two levels of flow tables used

has outstanding receive packets),

> +by RFS: rps_sock_flow_table and rps_dev_flow_table.
> +
> +rps_sock_table is a global flow table. Each table value is a CPU index
> +and is populated by recvmsg and sendmsg (specifically, inet_recvmsg(),
> +inet_sendmsg(), inet_sendpage() and tcp_splice_read()). This table
> +contains the *desired* CPUs for flows.
> +
> +rps_dev_flow_table is specific to each hardware receive queue of each
> +device.  Each table value stores a CPU index and a counter. The CPU
> +index represents the *current* CPU that is assigned to processing the
> +matching flows.
> +
> +The counter records the length of this CPU's backlog when a packet in
> +this flow was last enqueued.  Each backlog queue has a head counter that
> +is incremented on dequeue. A tail counter is computed as head counter +
> +queue length. In other words, the counter in rps_dev_flow_table[i]
> +records the last element in flow i that has been enqueued onto the
> +currently designated CPU for flow i (of course, entry i is actually
> +selected by hash and multiple flows may hash to the same entry i). 
> +
> +And now the trick for avoiding out of order packets: when selecting the
> +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
> +and the rps_dev_flow table of the queue that the packet was received on
> +are compared.  If the desired CPU for the flow (found in the
> +rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
> +table), the packet is enqueud onto that CPU’s backlog. If they differ,

                         enqueued

> +the current cpu is updated to match the desired CPU if one of the

s/cpu/CPU/ (globally as needed)

> +following is true:
> +
> +- The current CPU's queue head counter >= the recorded tail counter
> +  value in rps_dev_flow[i]
> +- The current CPU is unset (equal to NR_CPUS)
> +- The current CPU is offline
> +
> +After this check, the packet is sent to the (possibly updated) current
> +CPU.  These rules aim to ensure that a flow only moves to a new CPU when
> +there are no packets outstanding on the old CPU, as the outstanding
> +packets could arrive later than those about to be processed on the new
> +CPU.
> +
> +== RFS Configuration
> +
> +RFS is only available if the kernel flag CONFIG_RFS is enabled (on by

s/flag/kconfig symbol/

> +default for smp). The functionality is disabled without any

s/smp/SMP/

> +configuration. The number of entries in the global flow table is set
> +through:
> +
> + /proc/sys/net/core/rps_sock_flow_entries
> +
> +The number of entries in the per queue flow table are set through:

                                per-queue

> +
> + /sys/class/net/<dev>/queues/tx-<n>/rps_flow_cnt
> +
> +Both of these need to be set before RFS is enabled for a receive queue.
> +Values for both of these are rounded up to the nearest power of two. The
> +suggested flow count depends on the expected number active connections

                                                number of

> +at any given time, which may be significantly less than the number of
> +open connections. We have found that a value of 32768 for
> +rps_sock_flow_entries works fairly well on a moderately loaded server.
> +
> +For a single queue device, the rps_flow_cnt value for the single queue
> +would normally be configured to the same value as rps_sock_flow_entries.
> +For a multi-queue device, the rps_flow_cnt for each queue might be
> +configured as rps_sock_flow_entries / N, where N is the number of
> +queues. So for instance, if rps_flow_entries is set to 32768 and there
> +are 16 configured receive queues, rps_flow_cnt for each queue might be
> +configured as 2048.
> +
> +
> +Accelerated RFS
> +===============
> +
> +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated
> +load balancing mechanism that uses soft state to steer flows based on
> +where the thread consuming the packets of each flow is running.
> +Accelerated RFS should perform better than RFS since packets are sent
> +directly to a CPU local to the thread consuming the data. The target CPU
> +will either be the same CPU where the application runs, or at least a
> +CPU which is local to the application thread’s CPU in the cache
> +hierarchy. 
> +
> +To enable accelerated RFS, the networking stack calls the
> +ndo_rx_flow_steer driver function to communicate the desired hardware
> +queue for packets matching a particular flow. The network stack
> +automatically calls this function every time a flow entry in
> +rps_dev_flow_table is updated. The driver in turn uses a device specific

                                                            device-specific

> +method to program the NIC to steer the packets.
> +
> +The hardware queue for a flow is derived from the CPU recorded in
> +rps_dev_flow_table. The stack consults a CPU to hardware queue map which

                                            CPU-to-hardware-queue map

> +is maintained by the NIC driver. This is an autogenerated reverse map of
> +the IRQ affinity table shown by /proc/interrupts. Drivers can use
> +functions in the cpu_rmap (“cpu affinitiy reverse map”) kernel library
> +to populate the map. For each CPU, the corresponding queue in the map is
> +set to be one whose processing CPU is closest in cache locality.
> +
> +== Accelerated RFS Configuration
> +
> +Accelerated RFS is only available if the kernel is compiled with
> +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
> +It also requires that ntuple filtering is enabled via ethtool. The map
> +of CPU to queues is automatically deduced from the IRQ affinities
> +configured for each receive queue by the driver, so no additional
> +configuration should be necessary.
> +
> +XPS: Transmit Packet Steering
> +=============================
> +
> +Transmit Packet Steering is a mechanism for intelligently selecting
> +which transmit queue to use when transmitting a packet on a multi-queue
> +device. To accomplish this, a mapping from CPU to hardware queue(s) is
> +recorded. The goal of this mapping is usually to assign queues
> +exclusively to a subset of CPUs, where the transmit completions for
> +these queues are processed on a CPU within this set. This choice
> +provides two benefits. First, contention on the device queue lock is
> +significantly reduced since fewer CPUs contend for the same queue
> +(contention can be eliminated completely if each CPU has its own
> +transmit queue).  Secondly, cache miss rate on transmit completion is
> +reduced, in particular for data cache lines that hold the sk_buff
> +structures.
> +
> +XPS is configured per transmit queue by setting a bitmap of CPUs that
> +may use that queue to transmit. The reverse mapping, from CPUs to
> +transmit queues, is computed and maintained for each network device.
> +When transmitting the first packet in a flow, the function
> +get_xps_queue() is called to select a queue.  This function uses the ID
> +of the running CPU as a key into the CPU to queue lookup table. If the

                                        CPU-to-queue

> +ID matches a single queue, that is used for transmission.  If multiple
> +queues match, one is selected by using the flow hash to compute an index
> +into the set.
> +
> +The queue chosen for transmitting a particular flow is saved in the
> +corresponding socket structure for the flow (e.g. a TCP connection).
> +This transmit queue is used for subsequent packets sent on the flow to
> +prevent out of order (ooo) packets. The choice also amortizes the cost
> +of calling get_xps_queues() over all packets in the connection. To avoid
> +ooo packets, the queue for a flow can subsequently only be changed if
> +skb->ooo_okay is set for a packet in the flow. This flag indicates that
> +there are no outstanding packets in the flow, so the transmit queue can
> +change without the risk of generating out of order packets. The
> +transport layer is responsible for setting ooo_okay appropriately. TCP,
> +for instance, sets the flag when all data for a connection has been
> +acknowledged.
> +
> +
> +== XPS Configuration
> +
> +XPS is only available if the kernel flag CONFIG_XPS is enabled (on by

s/flag/kconfig symbol/

> +default for smp). The functionality is disabled without any

s/smp/SMP/

> +configuration, in which case the the transmit queue for a packet is
> +selected by using a flow hash as an index into the set of all transmit
> +queues for the device. To enable XPS, the bitmap of CPUs that may use a
> +transmit queue is configured using the sysfs file entry:
> +
> +/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
> +
> +XPS is disabled when it is zero (the default). IRQ-affinity.txt explains
> +how CPUs are assigned to the bitmap. 
> +
> +For a network device with a single transmission queue, XPS configuration
> +has no effect, since there is no choice in this case. In a multi-queue
> +system, XPS is usually configured so that each CPU maps onto one queue.
> +If there are as many queues as there are CPUs in the system, then each
> +queue can also map onto one CPU, resulting in exclusive pairings that
> +experience no contention. If there are fewer queues than CPUs, then the
> +best CPUs to share a given queue are probably those that share the cache
> +with the CPU that processes transmit completions for that queue
> +(transmit interrupts).
> +
> +
> +Further Information
> +===================
> +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
> +2.6.38. Original patches were submitted by Tom Herbert
> +(therbert@xxxxxxxxxx)
> +
> +
> +Accelerated RFS was introduced in 2.6.35. Original patches were
> +submitted by Ben Hutchings (bhutchings@xxxxxxxxxxxxxx)
> +
> +Authors:
> +Tom Herbert (therbert@xxxxxxxxxx)
> +Willem de Bruijn (willemb@xxxxxxxxxx)
> +
> -- 

Very nice writeup.  Thanks.

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html