Ben Greear <greearb@xxxxxxxxxxxxxxx> writes: > On 04/28/2020 12:39 PM, Dave Taht wrote: >> On Tue, Apr 28, 2020 at 12:37 PM Toke Høiland-Jørgensen <toke@xxxxxxx> wrote: >>> >>> Ben Greear <greearb@xxxxxxxxxxxxxxx> writes: >>> >>>> On 04/28/2020 07:56 AM, Steve deRosier wrote: >>>>> On Mon, Apr 27, 2020 at 7:54 AM <greearb@xxxxxxxxxxxxxxx> wrote: >>>>>> >>>>>> From: Ben Greear <greearb@xxxxxxxxxxxxxxx> >>>>>> >>>>>> While running tcp upload + download tests with ~200 >>>>>> concurrent TCP streams, 1-2 processes, and 30 station >>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking >>>>>> around 20% of the CPU according to perf-top, which other locking >>>>>> taking an additional ~15%. >>>>>> >>>>>> I believe the issue is that the ath10k driver would unlock the >>>>>> txqueue when a single frame could be transmitted, instead of >>>>>> waiting for a low water mark. >>>>>> >>>>>> So, this patch adds a low-water mark that is 1/4 of the total >>>>>> tx buffers allowed. >>>>>> >>>>>> This appears to resolve the performance problem that I saw. >>>>>> >>>>>> Tested with recent wave-1 ath10k-ct firmware. >>>>>> >>>>> >>>>> Hey Ben, >>>>> >>>>> Did you do any testing with this patch around latency? The nature of >>>>> the thing that you fixed makes me wonder if it was intentional with >>>>> respect to making WiFi fast - ie getting rid of buffers as much as >>>>> possible. Obviously the CPU impact is likely to be an unintended >>>>> consequence. In any case, I don't know anything for sure, it was just >>>>> a thought that went through my head when reading this. >>>> >>>> I did not, but on average my patch should make the queues be less full, >>>> so I doubt it will hurt latency. >>> >>> I would tend to agree with that. >> >> Well, I don't, as it's dependent on right sizing the ring in the first place. > > My patch, barring strange issues elsewhere, can only make the firmware tx queues less full on > average. > > If you want to test with different ring sizes, you can play with the tx_desc > setting in the ath10k-ct driver 'fwcfg' options. > > http://www.candelatech.com/ath10k-10.4.php#config > > My testing shows that overall throughput goes down when using lots of peers > if you have smaller numbers of txbuffers. This is because the firmware > will typically spread its buffers over lots of peers and have smaller ampdu > chains per transmit. An upper stack that more intelligently fed frames > to the firmware could mitigate this, and it is not all bad anyway since > giving everyone a 64 ampdu chains will increase burstiness at least > somewhat. Making each transmission shorter is arguably the right thing to do in the "extremely congested" scenario, though. If you have to wait your turn behind 100 other stations for your next TXOP you'd generally want each of those other stations to only transmit (say) 1ms instead of their full 4ms. Yes, this will hurt aggregate throughput somewhat, but I'd argue that in most cases the overall application performance would be better. You're right, though, that ideally this would be managed a bit smarter than by just running out of buffers :) > I've always envisioned that the stuff you and Toke and others have been > working on would help in this area, but I don't understand your stuff well > enough to know if that is true or not. It might, although as per above I'm not quite sure what "helps" really means in this context. What would you expect a good behaviour to be here? I think what you're alluding to is to limit the total number of stations that will be allowed to have outstanding data in the firmware at the same time, right? Here it would help a bit to know some more details of how the firmware manages its internal queueing, and how it schedules stations (if at all)? BTW, are you running any of these tests with AQL enabled? -Toke