I have been using Toke’s patch successfully with my Netgear R7800 (QCA9884) running OpenWRT snapshot build with ath10k-ct firmware and ath10k-ct-smallbuffers driver. I am using my R7800 as an AP only, so I am running a very stripped down build of OpenWRT. I provided several Flent tests to Dave Taht and Toke showing some pretty significant improvements in latency. I did not capture any CPU/sirq stats on my R7800 during these tests, though. It does appear my maximum download bandwidth has decreased somewhat, but I am more than happy to take that hit for the benefit of significantly lower latency. > On May 1, 2020, at 11:50 AM, John Deere <24601deerej@xxxxxxxxx> wrote: > > > > On 5/1/20 10:16 AM, Ben Greear wrote: >> On 04/30/2020 04:31 PM, John Deere wrote: >>> I've just tried Toke's DQL on an Archer C7 (QCA 988X) and it errors out with: >>> >>> ath10k_pci failed to increase tx pending count: -16, dropping >>> >>> This is with ath10k non-ct firmware and driver on OpenWrt (w/backports 5.4.27 & AQL). ath10k starts up but is unable to let any stations connect. >>> >>> I've also tried the standalone patch by Ben and it seems to have reduced the latencies on top of AQL by another 5 ms. >> Hello, >> Did you notice any throughput changes or system load changes in the test that you did with my patch? >> Thanks, >> Ben > > I have noticed that there has been a reduction in system load and memory use. Whereas previously with 11 clients on one Archer C7 acting as an AP only, my free memory available would be 51%, it now shows 55-56% - a 4% to 5% reduction. Note, these results were obtained alongside with reverting the following comit https://github.com/openwrt/openwrt/commit/1e27befe63ff4c69f110c7310316f4af75ee63e9. I believe that this same set of patches also are currently in use for the ath10k-ct smallbuffers variant on OpenWrt. > > I have not had the time to conduct any meaningful throughput measurements. > >>> >>> On 4/29/20 4:39 AM, Toke Høiland-Jørgensen wrote: >>>> Ben Greear <greearb@xxxxxxxxxxxxxxx> writes: >>>> >>>>> On 04/28/2020 12:37 PM, Toke Høiland-Jørgensen wrote: >>>>>> greearb@xxxxxxxxxxxxxxx writes: >>>>>> >>>>>>> From: Ben Greear <greearb@xxxxxxxxxxxxxxx> >>>>>>> >>>>>>> While running tcp upload + download tests with ~200 >>>>>>> concurrent TCP streams, 1-2 processes, and 30 station >>>>>>> vdevs, I noticed that the __ieee80211_stop_queue was taking >>>>>>> around 20% of the CPU according to perf-top, which other locking >>>>>>> taking an additional ~15%. >>>>>>> >>>>>>> I believe the issue is that the ath10k driver would unlock the >>>>>>> txqueue when a single frame could be transmitted, instead of >>>>>>> waiting for a low water mark. >>>>>>> >>>>>>> So, this patch adds a low-water mark that is 1/4 of the total >>>>>>> tx buffers allowed. >>>>>>> >>>>>>> This appears to resolve the performance problem that I saw. >>>>>>> >>>>>>> Tested with recent wave-1 ath10k-ct firmware. >>>>>>> >>>>>>> Signed-off-by: Ben Greear <greearb@xxxxxxxxxxxxxxx> >>>>>>> --- >>>>>>> drivers/net/wireless/ath/ath10k/htt.h | 1 + >>>>>>> drivers/net/wireless/ath/ath10k/htt_tx.c | 8 ++++++-- >>>>>>> 2 files changed, 7 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h >>>>>>> index 31c4ddbf45cb..b5634781c0dc 100644 >>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt.h >>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h >>>>>>> @@ -1941,6 +1941,7 @@ struct ath10k_htt { >>>>>>> >>>>>>> u8 target_version_major; >>>>>>> u8 target_version_minor; >>>>>>> + bool needs_unlock; >>>>>>> struct completion target_version_received; >>>>>>> u8 max_num_amsdu; >>>>>>> u8 max_num_ampdu; >>>>>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>>>> index 9b3c3b080e92..44795d9a7c0c 100644 >>>>>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>>>>> @@ -145,8 +145,10 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt) >>>>>>> lockdep_assert_held(&htt->tx_lock); >>>>>>> >>>>>>> htt->num_pending_tx--; >>>>>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1) >>>>>>> + if ((htt->num_pending_tx <= (htt->max_num_pending_tx / 4)) && htt->needs_unlock) { >>>>>> >>>>>> Why /4? Seems a bit arbitrary? >>>>> >>>>> Yes, arbitrary for sure. I figure restart filling the queue when 1/4 >>>>> full so that it is unlikely to run dry. Possibly it should restart >>>>> sooner to keep it more full on average? >>>> >>>> Theoretically, the "keep the queue at the lowest possible level that >>>> keeps it from underflowing" is what BQL is supposed to do. The diff >>>> below uses the dynamic adjustment bit (from dynamic_queue_limits.h) in >>>> place of num_pending_tx. I've only compile tested it, and I'm a bit >>>> skeptical that it will work right for this, but if anyone wants to give >>>> it a shot, there it is. >>>> >>>> BTW, while doing that, I noticed there's a similar arbitrary limit in >>>> ath10k_mac_tx_push_pending() at max_num_pending_tx/2. So if you're going >>>> to keep the arbitrary limit maybe use the same one? :) >>>> >>>>> Before my patch, the behaviour would be to try to keep it as full as >>>>> possible, as in restart the queues as soon as a single slot opens up >>>>> in the tx queue. >>>> >>>> Yeah, that seems somewhat misguided as well, from a latency perspective, >>>> at least. But I guess that's what we're fixing with AQL. What does the >>>> firmware do with the frames queued within? Do they just go on a FIFO >>>> queue altogether, or something smarter? >>>> >>>> -Toke >>>> >>>> >>>> >>>> >>>> diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c >>>> index f26cc6989dad..72771ff38a94 100644 >>>> --- a/drivers/net/wireless/ath/ath10k/core.c >>>> +++ b/drivers/net/wireless/ath/ath10k/core.c >>>> @@ -2497,6 +2497,10 @@ static int ath10k_core_init_firmware_features(struct ath10k *ar) >>>> return -EINVAL; >>>> } >>>> + dql_init(&ar->htt.dql, HZ); >>>> + ar->htt.dql.max_limit = ar->htt.max_num_pending_tx; >>>> + ar->htt.dql.min_limit = 8; >>>> + >>>> if (ar->hw_params.num_peers) >>>> ar->max_num_peers = ar->hw_params.num_peers; >>>> else >>>> diff --git a/drivers/net/wireless/ath/ath10k/htt.h b/drivers/net/wireless/ath/ath10k/htt.h >>>> index 4a12564fc30e..19024d063896 100644 >>>> --- a/drivers/net/wireless/ath/ath10k/htt.h >>>> +++ b/drivers/net/wireless/ath/ath10k/htt.h >>>> @@ -13,6 +13,7 @@ >>>> #include <linux/dmapool.h> >>>> #include <linux/hashtable.h> >>>> #include <linux/kfifo.h> >>>> +#include <linux/dynamic_queue_limits.h> >>>> #include <net/mac80211.h> >>>> #include "htc.h" >>>> @@ -1965,8 +1966,8 @@ struct ath10k_htt { >>>> /* Protects access to pending_tx, num_pending_tx */ >>>> spinlock_t tx_lock; >>>> int max_num_pending_tx; >>>> - int num_pending_tx; >>>> int num_pending_mgmt_tx; >>>> + struct dql dql; >>>> struct idr pending_tx; >>>> wait_queue_head_t empty_tx_wq; >>>> diff --git a/drivers/net/wireless/ath/ath10k/htt_tx.c b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>> index e9d12ea708b6..911a79470bdf 100644 >>>> --- a/drivers/net/wireless/ath/ath10k/htt_tx.c >>>> +++ b/drivers/net/wireless/ath/ath10k/htt_tx.c >>>> @@ -144,8 +144,8 @@ void ath10k_htt_tx_dec_pending(struct ath10k_htt *htt) >>>> { >>>> lockdep_assert_held(&htt->tx_lock); >>>> - htt->num_pending_tx--; >>>> - if (htt->num_pending_tx == htt->max_num_pending_tx - 1) >>>> + dql_completed(&htt->dql, 1); >>>> + if (dql_avail(&htt->dql) > 0) >>>> ath10k_mac_tx_unlock(htt->ar, ATH10K_TX_PAUSE_Q_FULL); >>>> } >>>> @@ -153,11 +153,11 @@ int ath10k_htt_tx_inc_pending(struct ath10k_htt *htt) >>>> { >>>> lockdep_assert_held(&htt->tx_lock); >>>> - if (htt->num_pending_tx >= htt->max_num_pending_tx) >>>> + if (dql_avail(&htt->dql) <= 0) >>>> return -EBUSY; >>>> - htt->num_pending_tx++; >>>> - if (htt->num_pending_tx == htt->max_num_pending_tx) >>>> + dql_queued(&htt->dql, 1); >>>> + if (dql_avail(&htt->dql) <= 0) >>>> ath10k_mac_tx_lock(htt->ar, ATH10K_TX_PAUSE_Q_FULL); >>>> return 0; >>>> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c >>>> index 2d03b8dd3b8c..1fe251742b0a 100644 >>>> --- a/drivers/net/wireless/ath/ath10k/mac.c >>>> +++ b/drivers/net/wireless/ath/ath10k/mac.c >>>> @@ -3998,7 +3998,7 @@ static bool ath10k_mac_tx_can_push(struct ieee80211_hw *hw, >>>> if (ar->htt.tx_q_state.mode == HTT_TX_MODE_SWITCH_PUSH) >>>> return true; >>>> - if (ar->htt.num_pending_tx < ar->htt.tx_q_state.num_push_allowed) >>>> + if (dql_avail(&ar->htt.dql) < ar->htt.tx_q_state.num_push_allowed) >>>> return true; >>>> if (artxq->num_fw_queued < artxq->num_push_allowed) >>>> @@ -4159,7 +4159,7 @@ void ath10k_mac_tx_push_pending(struct ath10k *ar) >>>> if (ar->htt.tx_q_state.mode != HTT_TX_MODE_SWITCH_PUSH) >>>> return; >>>> - if (ar->htt.num_pending_tx >= (ar->htt.max_num_pending_tx / 2)) >>>> + if (dql_avail(&ar->htt.dql) < (ar->htt.dql.limit / 2)) >>>> return; >>>> rcu_read_lock(); >>>> @@ -7160,7 +7160,7 @@ void ath10k_mac_wait_tx_complete(struct ath10k *ar) >>>> bool empty; >>>> spin_lock_bh(&ar->htt.tx_lock); >>>> - empty = (ar->htt.num_pending_tx == 0); >>>> + empty = (ar->htt.dql.num_completed == ar->htt.dql.num_queued); >>>> spin_unlock_bh(&ar->htt.tx_lock); >>>> skip = (ar->state == ATH10K_STATE_WEDGED) || >>>> diff --git a/drivers/net/wireless/ath/ath10k/txrx.c b/drivers/net/wireless/ath/ath10k/txrx.c >>>> index 39abf8b12903..fe7cd53c2bf9 100644 >>>> --- a/drivers/net/wireless/ath/ath10k/txrx.c >>>> +++ b/drivers/net/wireless/ath/ath10k/txrx.c >>>> @@ -80,7 +80,7 @@ int ath10k_txrx_tx_unref(struct ath10k_htt *htt, >>>> ath10k_htt_tx_free_msdu_id(htt, tx_done->msdu_id); >>>> ath10k_htt_tx_dec_pending(htt); >>>> - if (htt->num_pending_tx == 0) >>>> + if (htt->dql.num_completed == htt->dql.num_queued) >>>> wake_up(&htt->empty_tx_wq); >>>> spin_unlock_bh(&htt->tx_lock); >>>> >>>