Re: rt2800 tx frame dropping issue.

Daniel Santos <daniel.santos@xxxxxxxxx> · Mon, 3 Dec 2018 15:44:46 -0600

On 11/26/18 3:38 AM, Stanislaw Gruszka wrote:
> On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote:
>> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote:
>>
>>> I believe I have the answer as to why we're getting frames after we stop
>>> the queue.  I had a little chat about this in #kernelnewbies and some
>>> other devs believe it is intentional.
>>>
>>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between
>>> releasing local->queue_stop_reason_lock and calling the driver's tx
>>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between
>>> these two points the thread can be preempted.  So while we stop the
>>> queue in one thread, there can be 20 other threads in between that have
>>> already checked that the queue was running and have a frame buffer
>>> sitting on their stacks.
>> Not 20, only 1 per netdev queue. I suspect that means just 1 per
>> hardware queue, but I don't know how rt2x00 maps netdev queues to
>> hardware queues. If you have lots of netdevs, that might actually be 20,
>> but I suspect that's not what's going on here.
>>
>>>   I think it could be fixed with the below
>>> patch, but I'm being told that it doesn't need it and that the driver
>>> should just *quietly* drop the frame:
>> [snip patch]
>>
>>> Could anybody illuminate for me why this isn't done?  I know that there
>>> has to be a point where we realize there are too many items in the queue
>>> and we can't keep up, but this would seem to be a terrible way to do
>>> that.  Is it one of those things where it isn't worth the performance
>>> degradation?  Any insights would be most appreciated!! :)
>> There's just not much point, and doing the locking here will mean it's
>> across _all_ queues, whereas if you have multiple hardware queues you
>> wouldn't really need it.
>>
>> Note that with internal TXQs with fq/codel support etc. we actually have
>> the fq->lock and it's global, and it turns out that's still a better
>> deal than actually doing parallel TX. So this may not matter so much.
>>
>> In any case, while this might solve the problem for the specific case
>> you're thinking about, as soon as you have something else starting or
>> stopping queues from outside the TX path it still wouldn't actually
>> help.
>>
>> By the way, one thing that can likely exacerbate the problem is
>> fragmentation, once you have that you're more likely to trigger lots of
>> frames in close succession.
>>
>> What I would suggest doing in the driver is actually stop the queues
>> once a *threshold* is reached, rather than being full. Say if you have
>> 128 entries in the HW queue, you can stop once you reach 120 (you
>> probably don't really need the other 8 places). If you get a 121st
>> frame, then you can still put it on the queue (until you filled 128),
>> but you can also stop the queue again (stopping queues is idempotent).
> That what rt2x00 does, but we have 64 entries on tx queues and
> because of that threshold is small. In:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=82751
>
> I proposed increase of queue size to 256 and hence make threshold
> bigger. However I was concerned about bufferbloat and requested
> testing from users how this affect latency. Never get testing 
> results :-(
>
> Thanks
> Stanislaw
>
Hello Stanislaw,

I almost managed to get that patch in a build to send to somebody who
can reproduce the error in abundance, but they have 15 different people
hammer the router to do it and we ended up sending them a few other
experimental builds instead.

I'm still learning this driver, but I don't see where it creates a
struct net_device -- was that something that came out after the driver
was originally written? (And maybe gets implicitly created somewhere
else?)  iiuc, the best way to do this is by setting tx_queue_len while
the interface is down (via "ip link") and then allocating the queue when
it's brought up.

Unless you know of a problem with this approach, I'm planning on making
a patch just for that.  It will then be easier to tune for an end user's
particular application.  For instance, if there is a small number of
uses who just use a very large amount of bandwidth then buffer bloat
could become a problem if it's too high.  But for a larger number of
users I don't think the buffer bloat probably will matter as much as
lost performance from dropping frames and needing to re-request many
lost packets at the TCP layer.  This would also result in more uplink
bandwidth being consumed.

BTW, I guess we need that struct net_device object in order to increment
tx_dropped properly as well.

Thanks,
Daniel