Re: [PATCH v2] net: add Documentation/networking/scaling.txt

Rick Jones <rick.jones2@xxxxxx> · Thu, 11 Aug 2011 11:02:18 -0700

On 08/11/2011 09:31 AM, Eric Dumazet wrote:
Le jeudi 11 août 2011 à 10:26 -0400, Will de Bruijn a écrit :

I'll be happy to revise it once more. This version also lacks the
required one-line description in Documentation/networking/00-INDEX, so
I will have to resubmit, either way.

Well, patch was already accepted by David in net tree two days ago ;)

Didn't see the customary "Applied" email - mailer glitch somewhere?

Anyhow, regardless of how further changes are made, or if they are made, 
here's the bits I was considering might a matter of opinion, or perhaps 
simply stripes on the bikeshed...

<rss>
+== Suggested Configuration
+
+RSS should be enabled when latency is a concern or whenever receive
+interrupt processing forms a bottleneck. Spreading load between CPUs
+decreases queue length. For low latency networking, the optimal setting
+is to allocate as many queues as there are CPUs in the system (or the
+NIC maximum, if lower). Because the aggregate number of interrupts grows
+with each additional queue, the most efficient high-rate configuration
+is likely the one with the smallest number of receive queues where no
+CPU that processes receive interrupts reaches 100% utilization. Per-cpu
+load can be observed using the mpstat utility.

Whether it lowers latency in the absence of an interrupt processing 
bottleneck depends on whether or not the application(s) receiving the 
data are able/allowed to run on the CPU(s) to which the IRQs of the 
queues are directed right?

Also, what mpstat and its ilk shows as CPUs could be HW threads - is it 
indeed the case that one is optimal when there are as many queues as 
there are HW threads, or is it when there are as many queues as there 
are discrete cores?

If I have disabled interrupt coalescing in the name of latency, does the 
number of queues actually affect the number of interrupts?

Certainly any CPU processing interrupts that stays below 100% 
utilization is less likely to be a bottleneck, but if there are 
algorithms/heuristics that get more efficient under load, staying below 
the 100% CPU utilization mark doesn't mean that peak efficiency has been 
reached.  If there is something that processes more and more packets per 
lock grab/release then it is actually most efficient in terms of packets 
processed per unit CPU consumption once one gets to the ragged edge of 
saturation.

Is utilization of the rx ring associated with the queue the more 
accurate, albeit unavailable, measure of saturation?

<rps>
+== Suggested Configuration
+
+For a single queue device, a typical RPS configuration would be to set
+the rps_cpus to the CPUs in the same cache domain of the interrupting
+CPU. If NUMA locality is not an issue, this could also be all CPUs in
+the system. At high interrupt rate, it might be wise to exclude the
+interrupting CPU from the map since that already performs much work.
+
+For a multi-queue system, if RSS is configured so that a hardware
+receive queue is mapped to each CPU, then RPS is probably redundant
+and unnecessary. If there are fewer hardware queues than CPUs, then
+RPS might be beneficial if the rps_cpus for each queue are the ones that
+share the same cache domain as the interrupting CPU for that queue.

This isn't the first mention of "cache domain" - there is actually one 
above it in the RSS Configuration section, but is the anticipated 
audience reasonably expected to already know what a cache domain is, 
particularly as it may relate/differ from NUMA locality?

A very simplistic search for "cache domain" against Documentation/ 
doesn't find that term used anywhere else.

<rfs>
+When the scheduler moves a thread to a new CPU while it has outstanding
+receive packets on the old CPU, packets may arrive out of order. To
+avoid this, RFS uses a second flow table to track outstanding packets
+for each flow: rps_dev_flow_table is a table specific to each hardware
+receive queue of each device. Each table value stores a CPU index and a
+counter. The CPU index represents the *current* CPU onto which packets
+for this flow are enqueued for further kernel processing. Ideally, kernel
+and userspace processing occur on the same CPU, and hence the CPU index
+in both tables is identical. This is likely false if the scheduler has
+recently migrated a userspace thread while the kernel still has packets
+enqueued for kernel processing on the old CPU.

This one is more drift than critique of the documentation itself, but 
just how often is the scheduler shuffling a thread of execution around 
anyway?  I would have thought that was happening on a timescale that 
would seem positively glacial compared to packet arrival rates.

<accelerated rfs>
+== Suggested Configuration
+
+This technique should be enabled whenever one wants to use RFS and the
+NIC supports hardware acceleration.

Again, drifting from critique simply of the documentation, but if 
accelerated RFS is indeed goodness when RFS is being used and the NIC HW 
supports it, shouldn't it be enabled automagically?  And then drifting 
back to the documentation itself, if accelerated RFS isn't enabled 
automagically with RFS today, does the reason suggest a caveat to the 
suggested configuration?

<xps>
+The queue chosen for transmitting a particular flow is saved in the
+corresponding socket structure for the flow (e.g. a TCP connection).
+This transmit queue is used for subsequent packets sent on the flow to
+prevent out of order (ooo) packets. The choice also amortizes the cost
+of calling get_xps_queues() over all packets in the connection. To avoid
+ooo packets, the queue for a flow can subsequently only be changed if
+skb->ooo_okay is set for a packet in the flow. This flag indicates that
+there are no outstanding packets in the flow, so the transmit queue can
+change without the risk of generating out of order packets. The
+transport layer is responsible for setting ooo_okay appropriately. TCP,
+for instance, sets the flag when all data for a connection has been
+acknowledged.

I'd probably go with "over all packets in the flow" as that part is in 
the "generic" discussion space rather than the specific example of a TCP 
connection.

And I'm curious/confused about rates of thread migration vs packets - it 
seems like the mechanisms in place to avoid OOO packets have a property 
that the queue selected can remain "stuck" when the packet rates are 
sufficiently high.  If being stuck isn't likely, it suggests that 
"normal" processing is enough to get packets drained - that the thread 
of execution is (at least in the context of sending and receiving 
traffic) going idle.  Is that then consistent with that thread of 
execution being bounced from CPU to CPU by the scheduler in the first place?

In the specific example of TCP, I see where ACK of data is sufficient to 
guarantee no OOO on outbound when migrating, but all that is really 
necessary is transmit completion by the NIC, no?  Admittedly, getting 
that information to TCP is probably undesired overhead, but doesn't 
using the ACK "penalize" the thread/TCP talking to more remote (in terms 
of RTT) destinations?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html