On Sun, 31 Jul 2011 23:56:26 -0700 (PDT) Tom Herbert wrote: > Describes RSS, RPS, RFS, accelerated RFS, and XPS. > > Signed-off-by: Tom Herbert <therbert@xxxxxxxxxx> > --- > Documentation/networking/scaling.txt | 346 ++++++++++++++++++++++++++++++++++ > 1 files changed, 346 insertions(+), 0 deletions(-) > create mode 100644 Documentation/networking/scaling.txt > > diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt > new file mode 100644 > index 0000000..aa51f0f > --- /dev/null > +++ b/Documentation/networking/scaling.txt > @@ -0,0 +1,346 @@ > +Scaling in the Linux Networking Stack > + > + > +Introduction > +============ > + > +This document describes a set of complementary techniques in the Linux > +networking stack to increase parallelism and improve performance (in > +throughput, latency, CPU utilization, etc.) for multi-processor systems. > + > +The following technologies are described: > + > + RSS: Receive Side Scaling > + RPS: Receive Packet Steering > + RFS: Receive Flow Steering > + Accelerated Receive Flow Steering > + XPS: Transmit Packet Steering > + > + > +RSS: Receive Side Scaling > +========================= > + > +Contemporary NICs support multiple receive queues (multi-queue), which > +can be used to distribute packets amongst CPUs for processing. The NIC > +distributes packets by applying a filter to each packet to assign it to > +one of a small number of logical flows. Packets for each flow are > +steered to a separate receive queue, which in turn can be processed by > +separate CPUs. This mechanism is generally known as “Receive-side > +Scaling” (RSS). > + > +The filter used in RSS is typically a hash function over the network or > +transport layer headers-- for example, a 4-tuple hash over IP addresses > +and TCP ports of a packet. The most common hardware implementation of > +RSS uses a 128 entry indirection table where each entry stores a queue 128-entry > +number. The receive queue for a packet is determined by masking out the > +low order seven bits of the computed hash for the packet (usually a > +Toeplitz hash), taking this number as a key into the indirection table > +and reading the corresponding value. > + > +Some advanced NICs allow steering packets to queues based on > +programmable filters. For example, webserver bound TCP port 80 packets > +can be directed to their own receive queue. Such “n-tuple” filters can > +be configured from ethtool (--config-ntuple). > + > +== RSS Configuration > + > +The driver for a multi-queue capable NIC typically provides a module > +parameter specifying the number of hardware queues to configure. In the > +bnx2x driver, for instance, this parameter is called num_queues. A > +typical RSS configuration would be to have one receive queue for each > +CPU if the device supports enough queues, or otherwise at least one for > +each cache domain at a particular cache level (L1, L2, etc.). > + > +The indirection table of an RSS device, which resolves a queue by masked > +hash, is usually programmed by the driver at initialization. The > +default mapping is to distribute the queues evenly in the table, but the > +indirection table can be retrieved and modified at runtime using ethtool > +commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the > +indirection table could be done to to give different queues different ^^drop one "to" > +relative weights. Drop trailing whitespace above and anywhere else that it's found. (5 places) I thought (long ago :) that multiple RX queues were for prioritizing traffic, but there is nothing here about using multi-queues for priorities. Is that (no longer) done? > + > +== RSS IRQ Configuration > + > +Each receive queue has a separate IRQ associated with it. The NIC > +triggers this to notify a CPU when new packets arrive on the given > +queue. The signaling path for PCIe devices uses message signaled > +interrupts (MSI-X), that can route each interrupt to a particular CPU. > +The active mapping of queues to IRQs can be determined from > +/proc/interrupts. By default, all IRQs are routed to CPU0. Because a > +non-negligible part of packet processing takes place in receive > +interrupt handling, it is advantageous to spread receive interrupts > +between CPUs. To manually adjust the IRQ affinity of each interrupt see > +Documentation/IRQ-affinity. On some systems, the irqbalance daemon is > +running and will try to dynamically optimize this setting. or (avoid a split infinitive): will try to optimize this setting dynamically. > + > + > +RPS: Receive Packet Steering > +============================ > + > +Receive Packet Steering (RPS) is logically a software implementation of > +RSS. Being in software, it is necessarily called later in the datapath. > +Whereas RSS selects the queue and hence CPU that will run the hardware > +interrupt handler, RPS selects the CPU to perform protocol processing > +above the interrupt handler. This is accomplished by placing the packet > +on the desired CPU’s backlog queue and waking up the CPU for processing. > +RPS has some advantages over RSS: 1) it can be used with any NIC, 2) > +software filters can easily be added to handle new protocols, 3) it does > +not increase hardware device interrupt rate (but does use IPIs). > + > +RPS is called during bottom half of the receive interrupt handler, when > +a driver sends a packet up the network stack with netif_rx() or > +netif_receive_skb(). These call the get_rps_cpu() function, which > +selects the queue that should process a packet. > + > +The first step in determining the target CPU for RPS is to calculate a > +flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash > +depending on the protocol). This serves as a consistent hash of the > +associated flow of the packet. The hash is either provided by hardware > +or will be computed in the stack. Capable hardware can pass the hash in > +the receive descriptor for the packet, this would usually be the same packet; > +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in > +skb->rx_hash and can be used elsewhere in the stack as a hash of the > +packet’s flow. > + > +Each receive hardware qeueue has associated list of CPUs which can has an associated list (?) > +process packets received on the queue for RPS. For each received > +packet, an index into the list is computed from the flow hash modulo the > +size of the list. The indexed CPU is the target for processing the > +packet, and the packet is queued to the tail of that CPU’s backlog > +queue. At the end of the bottom half routine, inter-processor interrupts > +(IPIs) are sent to any CPUs for which packets have been queued to their > +backlog queue. The IPI wakes backlog processing on the remote CPU, and > +any queued packets are then processed up the networking stack. Note that > +the list of CPUs can be configured separately for each hardware receive > +queue. > + > +== RPS Configuration > + > +RPS requires a kernel compiled with the CONFIG_RPS flag (on by default s/flag/kconfig symbol/ > +for smp). Even when compiled in, it is disabled without any for SMP). > +configuration. The list of CPUs to which RPS may forward traffic can be > +configured for each receive queue using the sysfs file entry: > + > + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus > + > +This file implements a bitmap of CPUs. RPS is disabled when it is zero > +(the default), in which case packets are processed on the interrupting > +CPU. IRQ-affinity.txt explains how CPUs are assigned to the bitmap. > + > +For a single queue device, a typical RPS configuration would be to set > +the rps_cpus to the CPUs in the same cache domain of the interrupting > +CPU for a queue. If NUMA locality is not an issue, this could also be > +all CPUs in the system. At high interrupt rate, it might wise to exclude it might be wise > +the interrupting CPU from the map since that already performs much work. > + > +For a multi-queue system, if RSS is configured so that a receive queue > +is mapped to each CPU, then RPS is probably redundant and unnecessary. > +If there are fewer queues than CPUs, then RPS might be beneficial if the > +rps_cpus for each queue are the ones that share the same cache domain as > +the interrupting CPU for the queue. > + > +RFS: Receive Flow Steering > +========================== > + > +While RPS steers packet solely based on hash, and thus generally steers packets > +provides good load distribution, it does not take into account > +application locality. This is accomplished by Receive Flow Steering > +(RFS). The goal of RFS is to increase datacache hitrate by steering > +kernel processing of packets to the CPU where the application thread > +consuming the packet is running. RFS relies on the same RPS mechanisms > +to enqueue packets onto the backlog of another CPU and to wake that CPU. > + > +In RFS, packets are not forwarded directly by the value of their hash, > +but the hash is used as index into a flow lookup table. This table maps > +flows to the CPUs where those flows are being processed. The flow hash > +(see RPS section above) is used to calculate the index into this table. > +The CPU recorded in each entry is the one which last processed the flow, > +and if there is not a valid CPU for an entry, then packets mapped to > +that entry are steered using plain RPS. > + > +To avoid out of order packets (ie. when scheduler moves a thread with (i.e., when the scheduler moves a thread that > +outstanding receive packets on) there are two levels of flow tables used has outstanding receive packets), > +by RFS: rps_sock_flow_table and rps_dev_flow_table. > + > +rps_sock_table is a global flow table. Each table value is a CPU index > +and is populated by recvmsg and sendmsg (specifically, inet_recvmsg(), > +inet_sendmsg(), inet_sendpage() and tcp_splice_read()). This table > +contains the *desired* CPUs for flows. > + > +rps_dev_flow_table is specific to each hardware receive queue of each > +device. Each table value stores a CPU index and a counter. The CPU > +index represents the *current* CPU that is assigned to processing the > +matching flows. > + > +The counter records the length of this CPU's backlog when a packet in > +this flow was last enqueued. Each backlog queue has a head counter that > +is incremented on dequeue. A tail counter is computed as head counter + > +queue length. In other words, the counter in rps_dev_flow_table[i] > +records the last element in flow i that has been enqueued onto the > +currently designated CPU for flow i (of course, entry i is actually > +selected by hash and multiple flows may hash to the same entry i). > + > +And now the trick for avoiding out of order packets: when selecting the > +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table > +and the rps_dev_flow table of the queue that the packet was received on > +are compared. If the desired CPU for the flow (found in the > +rps_sock_flow table) matches the current CPU (found in the rps_dev_flow > +table), the packet is enqueud onto that CPU’s backlog. If they differ, enqueued > +the current cpu is updated to match the desired CPU if one of the s/cpu/CPU/ (globally as needed) > +following is true: > + > +- The current CPU's queue head counter >= the recorded tail counter > + value in rps_dev_flow[i] > +- The current CPU is unset (equal to NR_CPUS) > +- The current CPU is offline > + > +After this check, the packet is sent to the (possibly updated) current > +CPU. These rules aim to ensure that a flow only moves to a new CPU when > +there are no packets outstanding on the old CPU, as the outstanding > +packets could arrive later than those about to be processed on the new > +CPU. > + > +== RFS Configuration > + > +RFS is only available if the kernel flag CONFIG_RFS is enabled (on by s/flag/kconfig symbol/ > +default for smp). The functionality is disabled without any s/smp/SMP/ > +configuration. The number of entries in the global flow table is set > +through: > + > + /proc/sys/net/core/rps_sock_flow_entries > + > +The number of entries in the per queue flow table are set through: per-queue > + > + /sys/class/net/<dev>/queues/tx-<n>/rps_flow_cnt > + > +Both of these need to be set before RFS is enabled for a receive queue. > +Values for both of these are rounded up to the nearest power of two. The > +suggested flow count depends on the expected number active connections number of > +at any given time, which may be significantly less than the number of > +open connections. We have found that a value of 32768 for > +rps_sock_flow_entries works fairly well on a moderately loaded server. > + > +For a single queue device, the rps_flow_cnt value for the single queue > +would normally be configured to the same value as rps_sock_flow_entries. > +For a multi-queue device, the rps_flow_cnt for each queue might be > +configured as rps_sock_flow_entries / N, where N is the number of > +queues. So for instance, if rps_flow_entries is set to 32768 and there > +are 16 configured receive queues, rps_flow_cnt for each queue might be > +configured as 2048. > + > + > +Accelerated RFS > +=============== > + > +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated > +load balancing mechanism that uses soft state to steer flows based on > +where the thread consuming the packets of each flow is running. > +Accelerated RFS should perform better than RFS since packets are sent > +directly to a CPU local to the thread consuming the data. The target CPU > +will either be the same CPU where the application runs, or at least a > +CPU which is local to the application thread’s CPU in the cache > +hierarchy. > + > +To enable accelerated RFS, the networking stack calls the > +ndo_rx_flow_steer driver function to communicate the desired hardware > +queue for packets matching a particular flow. The network stack > +automatically calls this function every time a flow entry in > +rps_dev_flow_table is updated. The driver in turn uses a device specific device-specific > +method to program the NIC to steer the packets. > + > +The hardware queue for a flow is derived from the CPU recorded in > +rps_dev_flow_table. The stack consults a CPU to hardware queue map which CPU-to-hardware-queue map > +is maintained by the NIC driver. This is an autogenerated reverse map of > +the IRQ affinity table shown by /proc/interrupts. Drivers can use > +functions in the cpu_rmap (“cpu affinitiy reverse map”) kernel library > +to populate the map. For each CPU, the corresponding queue in the map is > +set to be one whose processing CPU is closest in cache locality. > + > +== Accelerated RFS Configuration > + > +Accelerated RFS is only available if the kernel is compiled with > +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. > +It also requires that ntuple filtering is enabled via ethtool. The map > +of CPU to queues is automatically deduced from the IRQ affinities > +configured for each receive queue by the driver, so no additional > +configuration should be necessary. > + > +XPS: Transmit Packet Steering > +============================= > + > +Transmit Packet Steering is a mechanism for intelligently selecting > +which transmit queue to use when transmitting a packet on a multi-queue > +device. To accomplish this, a mapping from CPU to hardware queue(s) is > +recorded. The goal of this mapping is usually to assign queues > +exclusively to a subset of CPUs, where the transmit completions for > +these queues are processed on a CPU within this set. This choice > +provides two benefits. First, contention on the device queue lock is > +significantly reduced since fewer CPUs contend for the same queue > +(contention can be eliminated completely if each CPU has its own > +transmit queue). Secondly, cache miss rate on transmit completion is > +reduced, in particular for data cache lines that hold the sk_buff > +structures. > + > +XPS is configured per transmit queue by setting a bitmap of CPUs that > +may use that queue to transmit. The reverse mapping, from CPUs to > +transmit queues, is computed and maintained for each network device. > +When transmitting the first packet in a flow, the function > +get_xps_queue() is called to select a queue. This function uses the ID > +of the running CPU as a key into the CPU to queue lookup table. If the CPU-to-queue > +ID matches a single queue, that is used for transmission. If multiple > +queues match, one is selected by using the flow hash to compute an index > +into the set. > + > +The queue chosen for transmitting a particular flow is saved in the > +corresponding socket structure for the flow (e.g. a TCP connection). > +This transmit queue is used for subsequent packets sent on the flow to > +prevent out of order (ooo) packets. The choice also amortizes the cost > +of calling get_xps_queues() over all packets in the connection. To avoid > +ooo packets, the queue for a flow can subsequently only be changed if > +skb->ooo_okay is set for a packet in the flow. This flag indicates that > +there are no outstanding packets in the flow, so the transmit queue can > +change without the risk of generating out of order packets. The > +transport layer is responsible for setting ooo_okay appropriately. TCP, > +for instance, sets the flag when all data for a connection has been > +acknowledged. > + > + > +== XPS Configuration > + > +XPS is only available if the kernel flag CONFIG_XPS is enabled (on by s/flag/kconfig symbol/ > +default for smp). The functionality is disabled without any s/smp/SMP/ > +configuration, in which case the the transmit queue for a packet is > +selected by using a flow hash as an index into the set of all transmit > +queues for the device. To enable XPS, the bitmap of CPUs that may use a > +transmit queue is configured using the sysfs file entry: > + > +/sys/class/net/<dev>/queues/tx-<n>/xps_cpus > + > +XPS is disabled when it is zero (the default). IRQ-affinity.txt explains > +how CPUs are assigned to the bitmap. > + > +For a network device with a single transmission queue, XPS configuration > +has no effect, since there is no choice in this case. In a multi-queue > +system, XPS is usually configured so that each CPU maps onto one queue. > +If there are as many queues as there are CPUs in the system, then each > +queue can also map onto one CPU, resulting in exclusive pairings that > +experience no contention. If there are fewer queues than CPUs, then the > +best CPUs to share a given queue are probably those that share the cache > +with the CPU that processes transmit completions for that queue > +(transmit interrupts). > + > + > +Further Information > +=================== > +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into > +2.6.38. Original patches were submitted by Tom Herbert > +(therbert@xxxxxxxxxx) > + > + > +Accelerated RFS was introduced in 2.6.35. Original patches were > +submitted by Ben Hutchings (bhutchings@xxxxxxxxxxxxxx) > + > +Authors: > +Tom Herbert (therbert@xxxxxxxxxx) > +Willem de Bruijn (willemb@xxxxxxxxxx) > + > -- Very nice writeup. Thanks. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html