Zefir Kurtisi <zefir.kurtisi@xxxxxxxxxxxx> writes: > On 01.12.20 14:33, Toke Høiland-Jørgensen wrote: >> Zefir Kurtisi <zefir.kurtisi@xxxxxxxxxxxx> writes: >> >>> CC += adrian >>> >>> On 24.11.20 15:45, Toke Høiland-Jørgensen wrote: >>>> Zefir Kurtisi <zefku@xxxxxxxxxxxx> writes: >>>> >>>>> Hi, >>>>> >>>>> I am running into a strange issue with the ath9k operating a 9590 >>>>> device which to me seems like a HW issue, but since work on rate >>>>> controllers is already going for decades, I hardly can imagine this >>>>> never showed up. >>>>> >>>>> The issue observed is this: the TX status descriptors never report >>>>> rateindex 1, it is always 0, 2, or 3, but never 1. >>>>> >>>>> I noticed this by overwriting the rate configuration provided by >>>>> minstrel to a static setup, e.g. (7,3)(5,3)(3,3)(1,3), all MCS. The >>>>> device operates as iperf client to a connected AP and continuously >>>>> transmits data. While at that, the attenuation between the endpoints >>>>> is gradually increased, expecting to see a gradual shift in the >>>>> reported TX status rateindex from 0 to 3. But nada, the values >>>>> reported are 0,2, and 3 - never 1. >>>>> >>>>> I double checked that the TX descriptors are correctly set with the >>>>> rates and retry counts - all looking sane. >>>>> >>>>> More obvious, after changing the rate configuration to >>>>> (7,3)(1,3)(5,3)(3,3) the expectation would be to have either 0 or 1 >>>>> reported as rateidx, since the transmission ought to be successful >>>>> with the lowest rate or never. Again all rates are reported but 1. >>>>> >>>>> Now the question for me is: what is the HW exactly doing with such a >>>>> configuration? Is it skipping the second rate, or is it just reporting >>>>> wrong? >>>> >>>> You should be able to see this by looking at the rates the frames are >>>> being sent at, shouldn't you? >>>> >>> Yes, did that and from there it points to that the second rate is just skipped. >>> >>> Here are some use cases and their sniffing results. Setup is a 11ng STA connected >>> to AP with the attenuation adjusted such that MCS 7 fails, while MCS 5 and below >>> succeed. Monitor is sniffing while sending a single ping from AP to STA. >>> >>> With a rate configuration of (7/2)(3/2)(1/2) we get: >>> 14:02:42.923880 9481489761us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: e Pad 20 KeyID 0 >>> 14:02:42.923909 9481490037us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: e Pad 20 KeyID 0 >>> 14:02:42.925244 9481491044us tsft 2412 MHz 11n -68dBm signal 13.0 Mb/s MCS 1 20 >>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: e Pad 20 KeyID 0 >>> >>> >>> with (7/2)(1/2)(3/2): >>> 13:59:37.073147 9295637087us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: c Pad 20 KeyID 0 >>> 13:59:37.073467 9295637438us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: c Pad 20 KeyID 0 >>> 13:59:37.074591 9295638498us tsft 2412 MHz 11n -68dBm signal 26.0 Mb/s MCS 3 20 >>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: c Pad 20 KeyID 0 >>> >>> and with (7/2)(3/2): >>> 14:04:27.269806 9585836783us tsft 2412 MHz 11n -69dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -69dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0 >>> 14:04:27.270342 9585837344us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0 >>> 14:04:27.271368 9585838370us tsft 2412 MHz 11n -68dBm signal 65.0 Mb/s MCS 7 20 >>> MHz long GI RX-STBC0 -68dBm signal antenna 0 Data IV: 10 Pad 20 KeyID 0 >>> [..] >>> >>> a total of 14 attempts at MCS 7 with the ping finally failing. >>> >>>>> Both possibilities have great impact, since upper layers (like >>>>> airtime) use the returned rateidx to calculate and configure operating >>>>> parameters at runtime. >>>> >>>> Have you actually observed any issues from this? If it's just skipping a >>>> rate, minstrel should still be able to make decisions based on the >>>> actual values returned, no? >>>> >>> The issues arise from the fact that the driver reports a >>> (tx-rateindex/tx-attemp-index) per TX descriptor, leaving the driver to calculate >>> what was put on air based on these two values. If one had rates set to >>> (7/2)(3/7)(1/2) and the TX status reports (tx-rateindex=2/tx-attempt-index=0), >>> driver assumes there were 10 attempts in total while in fact they were 3 when the >>> second rate is skipped. What direct effect this has on RC I can't grasp, but it >>> definitively falsifies statistics. >>> >>> Same goes for airtime: check how this falsifies its calculation in >>> ath_tx_count_airtime(). >> >> Ah, right, I was assuming that rates[1].count would be reset to zero >> somehow. Have you confirmed that the attempts actually go up on in the >> Minstrel stats for the skipped rate? >> >>> Also, the above mentioned is an immediate visible issue: if RC >>> provides two rates e.g. (7/3)(5/3) of which the first is too high and >>> the second is not even attempted, frames don't make it through. >> >> Yeah, rate control would likely take longer to converge to the right >> rate. I suppose if this is a hardware model-specific issue that a quirks >> bit could be added to instruct Minstrel to disregard the second index. >> But it does sound a bit odd; have you verified that it's consistent on >> different units of the same model (and not just a busted device)? >> > > False alarm. > > We got confirmation that the observed failure with that exact chip > revision is not happening on a different platform. It still might be a > HW issue specific to our rarely used PPC platform, but it is not an > ath9k malfunction. I'll dig further into that and report back if it is > relevant for the list. > > Thanks Toke for the feedback and insights and sorry for noise. You're welcome, and great to hear that you got closer to a resolution :) -Toke