David Miller wrote: > From: Gregory Haskins <ghaskins@xxxxxxxxxx> > Date: Wed, 18 Mar 2009 23:48:46 -0400 > > >> To see this in action, try taking a moderately large smp system >> (8-way+) and scaling the number of flows. >> > > I can maintain line-rate over 10GB with a 64-cpu box. Oh man, I am jealous of that 64-way :) How many simultaneous flows? What hardware? What qdisc and other config do you use? MTU? I cannot replicate such results on 10GB even with much smaller cpu counts. On my test rig here, I have a 10GB link connected by crossover between two 8-core boxes. Running one unidirectional TCP flow is typically toping out at ~5.5Gb/s on 2.6.29-rc8. Granted we are using MTU=1500, which in of itself is part of the upper limit. However, that result in of itself isn't a problem, per se. What is a problem is the aggregate bandwidth drops with scaling the number of flows. I would like to understand how to make this better, if possible, and perhaps I can learn something from your setup. > It's not > a problem. > To clarify terms, we are not saying "the stack performs inadequately". What we are saying here is that analysis of our workloads and of the current stack indicates that we are io-bound, and that this particular locking architecture in the qdisc subsystem is the apparent top gating factor from going faster. Therefore we are really saying "how can we make it even better"? This is not a bad question to ask in general, would you agree? To vet our findings, we built that prototype I mentioned in the last mail where we substituted the single queue and queue_lock with a per-cpu, lockless queue. This meant each cpu could submit work independent of the others with substantially reduced contention. More importantly, it eliminated the property of scaling the RFO pressure on a single cache-lne for the queue-lock. When we did this, we got significant increases in aggregate throughput (somewhere on the order of 6%-25% depending on workload, but this was last summer so I am a little hazy on the exact numbers). So you had said something to the effect of "Contention isn't implicitly a bad thing". I agree to a point. At least so much as contention cannot always be avoided. Ultimately we only have one resource in this equation: the phy-link in question. So naturally multiple flows targeted for that link will contend for it. But the important thing to note here is that there are different kinds of contention. And contention against spinlocks *is* generally bad, for multiple reasons. It not only affects the cores under contention, but it affects all cores that exist in the same coherency domain. IMO it should be avoided whenever possible. So I am not saying our per-cpu solution is the answer. But what I am saying is that we found that an architecture that doesnt piggy back all flows into a single spinlock does have the potential to unleash even more Linux-networking fury :) I haven't really been following the latest developments in netdev, but if I understand correctly part of what we are talking about here would be addressed by the new MQ stuff? And if I also understand that correctly, support for MQ is dependent on the hardware beneath it? If so, I wonder if we could apply some of the ideas I presented earlier for making "soft MQ" with a lockless-queue per flow or something like that? Thoughts? -Greg
Attachment:
signature.asc
Description: OpenPGP digital signature