Re: Regarding .wake_tx_queue() model

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 05 May 2020 18:50:45 +0200

Maxime Bizon <mbizon@xxxxxxxxxx> writes:

> On Tuesday 05 May 2020 à 15:53:08 (+0200), Toke Høiland-Jørgensen wrote:
>
>> Well, I think that should be fine? Having a longer HW queue is fine, as
>> long as you have some other logic to not fill it all the time (like the
>> "max two aggregates" logic I mentioned before). I think this is what
>> ath9k does too. At least it looks like both drv_tx() and
>> release_buffered_frames() will just abort (and drop in the former case)
>> if there is no HW buffer space left.
>
> Ok
>
> BTW, the "max two aggregates" rule, why is it based on number of
> frames and not duration ? if you are queing 1500 bytes @1Mbit/s, even
> one frame is enough, but not so for faster rates.

It's the minimum amount that works - assuming you get a TX completion
when one is done, the CPU has time to build the next one before the
second one is done, and you avoid starvation. Note this only works well
for aggregates, since their size tend to vary with rate; if you're
queueing individual packets to the HWQ you need something that takes
rate into account, which is what AQL (Airtime Queue Limits) does for
ath10k.

> It would be even better if minstrel could limit the total duration
> when computing number of hardware retries, and then mac80211 could
> handle software retries for those really slow packets, no hardware
> FIFO "pollution"

Minstrel will compute the max aggregate size based on the rate, which is
why the "two aggregates" scheme works. It likely could be smarter about
limiting the number of retries, as you say, but we never did get around
to doing anything about that :)

>> > Also .release_buffered_frames() codepath is difficult to test, how do
>> > you trigger that reliably ? assuming VIF is an AP, then you need the
>> > remote STA to go to sleep even though you have traffic waiting for it
>> > in the txqi. For now I patch the stack to introduce artificial delay.
>> >
>> > Since my hardware has a sticky per-STA PS filter with good status
>> > reporting, I'm considering using ieee80211_sta_block_awake() and only
>> > unblock STA when all its txqi are empty to get rid of
>> > .release_buffered_frames() complexity.
>> 
>> I'm probably not the right person to answer that; never did have a good
>> grip on the details of PS support.
>
> Hopefully Felix or Johannes will see this.
>
> Just wondering if there is anything I'm missing, this alternative way
> of doing looks easier to me because it removes "knowledge" of frame
> delivery under service period from the driver:
>
>
> 1) first get rid of buffered txq traffic when entering PS:
>
> --- a/net/mac80211/rx.c
> +++ b/net/mac80211/rx.c
> @@ -1593,6 +1593,15 @@ static void sta_ps_start(struct sta_info *sta)
>                         list_del_init(&txqi->schedule_order);
>                 spin_unlock(&local->active_txq_lock[txq->ac]);
>  
> -               if (txq_has_queue(txq))
> -                       set_bit(tid, &sta->txq_buffered_tids);
> -               else
> -                       clear_bit(tid, &sta->txq_buffered_tids);
> +               /* transfer txq into tx_filtered frames */
> +               spin_lock(&fq->lock);
> +               while ((skb = skb_dequeue(&txq->frags)))
> +                       skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> +               /* use something more efficient like fq_tin_reset  */
> +               while ((skb = fq_tin_dequeue(fq, tin, fq_tin_dequeue_func)))
> +                       skb_queue_tail(&sta->tx_filtered[txq->ac], skb);
> +               spin_unlock_bh(&fq->lock);

This seems like a bad idea; we want the TXQ mechanism to decide which
frame to send on wakeup.

> 2) driver register for STA_NOTIFY_SLEEP
>
> 3) driver count tx frames pending in the hardware per STA and sets
> ieee80211_sta_block_awake(sta, 1) when > 0
>
> 4) on tx completion, if STA is sleeping and number of pending tx frames in hardware for a
> given STA reaches 0:
>  - if driver buffers other frames for this STA, release them with TX_FILTERED in reverse order
>  - calls ieee80211_sta_block_awake(false)
>
> what do you think ?

As I said, I'm not an expert on the PS code, so I may be missing
something. But it seems to me that in a model where the driver pulls the
frames from mac80211 (i.e., for drivers using wake_tx_queue), there
really is no way around having a way to instruct the driver "please use
these flags for the next N frames you send" - which is what
release_buffered_frames() does. What you're suggesting is basically
turning off this 'pull mode' for the frames buffered during PS and have
mac80211 revert to push mode for those, right? But then you lose the
benefits of pull mode (the TXQs) for those frames.

I remember Johannes talking about a 'shim layer' between the mac80211
TXQs and the 'drv_tx()' hook as a way to bring the benefits of the TXQs
to the 'long tail' of simple drivers that don't do any internal
buffering anyway, without having to change the drivers to use 'pull
mode'. Am I wrong in thinking that mwl8k may be a good candidate for
such a layer? From glancing through the existing driver it looks like
it's mostly just taking each frame, wrapping it in a HW descriptor, and
sticking it on a TX ring?

>> What hardware is it you're writing a driver for, BTW, and are you
>> planning to upstream it? :)
>
> that's a rewrite of the mwl8k driver targeting the same hardware, but
> with a different firmware interface.
>
> if I can bring it on par with the existing set of supported hardware
> and features, I could try to upstream it yes.

That would be awesome! :)

-Toke