Ben Greear <greearb@xxxxxxxxxxxxxxx> writes: > On 04/28/2020 01:39 PM, Toke Høiland-Jørgensen wrote: >> Ben Greear <greearb@xxxxxxxxxxxxxxx> writes: >> >>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote: >>>> greearb@xxxxxxxxxxxxxxx writes: >>>> >>>>> From: Ben Greear <greearb@xxxxxxxxxxxxxxx> >>>>> >>>>> While running tcp upload + download tests with ~200 >>>>> concurrent TCP streams, 1-2 processes, and 30 station >>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking >>>>> around 20% of the CPU according to perf-top, which other locking >>>>> taking an additional ~15%. >>>>> >>>>> I believe the issue is that the ath10k driver would unlock the >>>>> txqueue when a single frame could be transmitted, instead of >>>>> waiting for a low water mark. >>>>> >>>>> So, this patch adds a low-water mark that is 1/4 of the total >>>>> tx buffers allowed. >>>>> >>>>> This appears to resolve the performance problem that I saw. >>>>> >>>>> Tested with recent wave-1 ath10k-ct firmware. >>>>> >>>>> Signed-off-by: Ben Greear <greearb@xxxxxxxxxxxxxxx> >>>>> --- >>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 + >>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++-- >>>>> 2 files changed, 7 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h >>>>> index 31c4ddbf45cb..b5634781c0dc 100644 >>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h >>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h >>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt { >>>>> >>>>> u8 target_version_major; >>>>> u8 target_version_minor; >>>>> + bool needs_unlock; >>>>> struct completion target_version_received; >>>>> u8 max_num_amsdu; >>>>> u8 max_num_ampdu; >>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>> index 9b3c3b080e92..44795d9a7c0c 100644 >>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt) >>>>> lockdep_assert_held(&htt->tx_lock); >>>>> >>>>> htt->num_pending_tx--; >>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1) >>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) { >>>> >>>> Why /4? Seems a bit arbitrary? >>> >>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4 >>> full so that it is unlikely to run dry. Possibly it should restart >>> sooner to keep it more full on average? >> >> Theoretically, the "keep the queue at the lowest possible level that >> keeps it from underflowing" is what BQL is supposed to do. The diff >> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in >> place of num_pending_tx. I've only compile tested it, and I'm a bit >> skeptical that it will work right for this, but if anyone wants to give >> it a shot, there it is. >> >> BTW, while doing that, I noticed there's a similar arbitrary limit in >> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going >> to keep the arbitrary limit maybe use the same one? :) >> >>> Before my patch, the behaviour would be to try to keep it as full as >>> possible, as in restart the queues as soon as a single slot opens up >>> in the tx queue. >> >> Yeah, that seems somewhat misguided as well, from a latency perspective, >> at least. But I guess that's what we're fixing with AQL. What does the >> firmware do with the frames queued within? Do they just go on a FIFO >> queue altogether, or something smarter? > > Sort of like a mini-mac80211 stack inside the firmware is used to > create ampdu/amsdu chains and schedule them with its own scheduler. > > For optimal throughput with 200 users steaming video, > the ath10k driver should think that it has only a few active peers wanting > to send data at a time (and so firmware would think the same), and the driver should > be fed a large chunk of pkts for those peers. And then the next few peers. > That should let firmware send large ampdu/amsdu to each peer, increasing throughput > over all. Yes, but also increasing latency because all other stations have to wait for a longer TXOP (see my other reply). > If you feed a few frames to each of the 200 peers, then even if > firmware has 2000 tx buffers, that is only 10 frames per peer at best, > leading to small ampdu/amsdu and thus worse over-all throughput and > utilization of airtime. Do you have any data on exactly how long (in time) each txop becomes in these highly congested scenarios? > It would be nice to be able to set certain traffic flows to have the > throughput optimization and others to have the latency optimization. > For instance, high latency on a streaming download is a good trade-off > if it increases total throughput. For the individual flows to a peer, fq_codel actually does a pretty good job at putting the latency-sensitive flows first. Which is why we want the queueing to happen in mac80211 (where fq_codel is active) instead of in the firmware. > Maybe some of the AI folks training their AI to categorize cat > pictures could instead start categorizing traffic flows and adjusting > the stack realtime... I know there are people trying to classify traffic with machine learning (although usually for more nefarious purposes), but I'm not sure it's feasible to do in a queue management algorithm (if at all). :) -Toke