On 11/26/18 3:38 AM, Stanislaw Gruszka wrote: > On Fri, Nov 23, 2018 at 08:45:54PM +0100, Johannes Berg wrote: >> On Tue, 2018-11-20 at 15:20 -0600, Daniel Santos wrote: >> >>> I believe I have the answer as to why we're getting frames after we stop >>> the queue. I had a little chat about this in #kernelnewbies and some >>> other devs believe it is intentional. >>> >>> There is a race in ieee80211_tx_frags (net/mac80211/tx.c) between >>> releasing local->queue_stop_reason_lock and calling the driver's tx >>> until we lock queue->tx_lock in rt2x00queue_write_tx_frame -- in between >>> these two points the thread can be preempted. So while we stop the >>> queue in one thread, there can be 20 other threads in between that have >>> already checked that the queue was running and have a frame buffer >>> sitting on their stacks. >> Not 20, only 1 per netdev queue. I suspect that means just 1 per >> hardware queue, but I don't know how rt2x00 maps netdev queues to >> hardware queues. If you have lots of netdevs, that might actually be 20, >> but I suspect that's not what's going on here. >> >>> I think it could be fixed with the below >>> patch, but I'm being told that it doesn't need it and that the driver >>> should just *quietly* drop the frame: >> [snip patch] >> >>> Could anybody illuminate for me why this isn't done? I know that there >>> has to be a point where we realize there are too many items in the queue >>> and we can't keep up, but this would seem to be a terrible way to do >>> that. Is it one of those things where it isn't worth the performance >>> degradation? Any insights would be most appreciated!! :) >> There's just not much point, and doing the locking here will mean it's >> across _all_ queues, whereas if you have multiple hardware queues you >> wouldn't really need it. >> >> Note that with internal TXQs with fq/codel support etc. we actually have >> the fq->lock and it's global, and it turns out that's still a better >> deal than actually doing parallel TX. So this may not matter so much. >> >> In any case, while this might solve the problem for the specific case >> you're thinking about, as soon as you have something else starting or >> stopping queues from outside the TX path it still wouldn't actually >> help. >> >> By the way, one thing that can likely exacerbate the problem is >> fragmentation, once you have that you're more likely to trigger lots of >> frames in close succession. >> >> What I would suggest doing in the driver is actually stop the queues >> once a *threshold* is reached, rather than being full. Say if you have >> 128 entries in the HW queue, you can stop once you reach 120 (you >> probably don't really need the other 8 places). If you get a 121st >> frame, then you can still put it on the queue (until you filled 128), >> but you can also stop the queue again (stopping queues is idempotent). > That what rt2x00 does, but we have 64 entries on tx queues and > because of that threshold is small. In: > > https://bugzilla.kernel.org/show_bug.cgi?id=82751 > > I proposed increase of queue size to 256 and hence make threshold > bigger. However I was concerned about bufferbloat and requested > testing from users how this affect latency. Never get testing > results :-( > > Thanks > Stanislaw > Hello Stanislaw, I almost managed to get that patch in a build to send to somebody who can reproduce the error in abundance, but they have 15 different people hammer the router to do it and we ended up sending them a few other experimental builds instead. I'm still learning this driver, but I don't see where it creates a struct net_device -- was that something that came out after the driver was originally written? (And maybe gets implicitly created somewhere else?) iiuc, the best way to do this is by setting tx_queue_len while the interface is down (via "ip link") and then allocating the queue when it's brought up. Unless you know of a problem with this approach, I'm planning on making a patch just for that. It will then be easier to tune for an end user's particular application. For instance, if there is a small number of uses who just use a very large amount of bandwidth then buffer bloat could become a problem if it's too high. But for a larger number of users I don't think the buffer bloat probably will matter as much as lost performance from dropping frames and needing to re-request many lost packets at the TCP layer. This would also result in more uplink bandwidth being consumed. BTW, I guess we need that struct net_device object in order to increment tx_dropped properly as well. Thanks, Daniel