I've verified that turning off hardware encryption on the AP and the STA is indeed preventing the issue. As soon as one of them is using the hardware encryption I can trigger the problem. (In my setup it seems to be mostly caused by the AP, since I needed sometimes as much as three rekeys to get the freeze when the AP was using Software and the STA hardware encryption.) So confident that we finally found the root of the evil I tried to write some code catching the races, see the attachment. It's probably not the best fix, but the only one I could think of and deploy myself with the knowledge I gathered here and the last days. It's only for CCMP for now and I've not checked on assumptions what some parts of the code are for. This is just "works for me". (It survived now 30 rekeys under my test load, when previously nearly every rekey did freeze it.) I think there is no need any longer to generate captures, but if you want them anyway I can of course still work on that. The decision to check if the pn is <= 10 as indicator to switch over to the "correct" pn counter is at best questionable. (Using 1 does not work. I have problems with the inital key here and the code was not switching over to the correct rx_pn counter and I expect that we also may lose some frames here.) I'm not even sure that it's safe to assume that pn is always starting at one, since wpa_supplicant is dumping some sequence number on rekey I would have assumed to be a more or less random start value. But in the debug prints the real value is always starting at one. So I'm using that for now. I'll now try that in my environment, keeping the insane low rekey interval at two minutes for at least some weeks. What was really surprising me here is, that this is such a generic issue and I'm finding that in my home environment. For my understanding that should break many (all?) EAP Wlan's. (I'm using EAP-TLS and that did make the WLAN basically unusable and any sane person would have switched back to PSK...) For completeness, here is the original openwrt bug report I opened: https://dev.openwrt.org/ticket/18966 Alexander
diff -ur linux-4.0.4-gentoo/net/mac80211/key.h linux-4.0.4-gentoo_patched/net/mac80211/key.h --- linux-4.0.4-gentoo/net/mac80211/key.h 2015-04-13 00:12:50.000000000 +0200 +++ linux-4.0.4-gentoo_patched/net/mac80211/key.h 2015-05-20 13:16:06.370256697 +0200 @@ -84,12 +84,14 @@ * Management frames. */ u8 rx_pn[IEEE80211_NUM_TIDS + 1][IEEE80211_CCMP_PN_LEN]; + u8 rx_pn_old[IEEE80211_NUM_TIDS + 1][IEEE80211_CCMP_PN_LEN]; struct crypto_aead *tfm; u32 replays; /* dot11RSNAStatsCCMPReplays */ } ccmp; struct { atomic64_t tx_pn; u8 rx_pn[IEEE80211_CMAC_PN_LEN]; + u8 rx_pn_old[IEEE80211_CMAC_PN_LEN]; struct crypto_cipher *tfm; u32 replays; /* dot11RSNAStatsCMACReplays */ u32 icverrors; /* dot11RSNAStatsCMACICVErrors */ @@ -97,6 +99,7 @@ struct { atomic64_t tx_pn; u8 rx_pn[IEEE80211_GMAC_PN_LEN]; + u8 rx_pn_old[IEEE80211_GMAC_PN_LEN]; struct crypto_aead *tfm; u32 replays; /* dot11RSNAStatsCMACReplays */ u32 icverrors; /* dot11RSNAStatsCMACICVErrors */ @@ -109,12 +112,14 @@ * Management frames. */ u8 rx_pn[IEEE80211_NUM_TIDS + 1][IEEE80211_GCMP_PN_LEN]; + u8 rx_pn_old[IEEE80211_NUM_TIDS + 1][IEEE80211_GCMP_PN_LEN]; struct crypto_aead *tfm; u32 replays; /* dot11RSNAStatsGCMPReplays */ } gcmp; struct { /* generic cipher scheme */ u8 rx_pn[IEEE80211_NUM_TIDS + 1][MAX_PN_LEN]; + u8 rx_pn_old[IEEE80211_NUM_TIDS + 1][MAX_PN_LEN]; } gen; } u; diff -ur linux-4.0.4-gentoo/net/mac80211/wpa.c linux-4.0.4-gentoo_patched/net/mac80211/wpa.c --- linux-4.0.4-gentoo/net/mac80211/wpa.c 2015-04-13 00:12:50.000000000 +0200 +++ linux-4.0.4-gentoo_patched/net/mac80211/wpa.c 2015-05-20 21:43:25.529721066 +0200 @@ -495,6 +495,7 @@ struct sk_buff *skb = rx->skb; struct ieee80211_rx_status *status = IEEE80211_SKB_RXCB(skb); u8 pn[IEEE80211_CCMP_PN_LEN]; + static u8 zero[IEEE80211_CCMP_PN_LEN]; int data_len; int queue; @@ -525,6 +526,31 @@ return RX_DROP_UNUSABLE; } + /* HACK: try to work around race when replacing PSK with enabled hardware offload on AP or STA */ + if (unlikely(memcmp(key->u.ccmp.rx_pn[queue], zero, IEEE80211_CCMP_PN_LEN) == 0 )) { + + printk ("DDD - rx_pn is zero, virgin key detected! pl=%x\n", pn[IEEE80211_CCMP_PN_LEN-1]); + print_hex_dump_debug("cnt: ", DUMP_PREFIX_OFFSET, IEEE80211_CCMP_PN_LEN, 1, key->u.ccmp.rx_pn[queue], IEEE80211_CCMP_PN_LEN, false); + print_hex_dump_debug("pn : ", DUMP_PREFIX_OFFSET, IEEE80211_CCMP_PN_LEN, 1, pn, IEEE80211_CCMP_PN_LEN, false); + + if ((memcmp(pn, zero, IEEE80211_CCMP_PN_LEN -1) == 0) && pn[IEEE80211_CCMP_PN_LEN-1] <= 10) { + /* pn is <=10 , we can start using the new counter */ + printk ("DDD - set new pn\n"); + memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN); + } else if (memcmp(pn, key->u.ccmp.rx_pn_old[queue], IEEE80211_CCMP_PN_LEN) <= 0) { + printk ("DDD - attack!\n"); + /* reply attack during rekey operation, guess it will really hard to do that... */ + key->u.ccmp.replays++; /* we count it as an reply anyway...*/ + return RX_DROP_UNUSABLE; + } else { + printk ("DDD - prevent poisening \n"); + memcpy(key->u.ccmp.rx_pn_old[queue], pn, IEEE80211_CCMP_PN_LEN); + } + } else { + + memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN); + } + if (!(status->flag & RX_FLAG_DECRYPTED)) { u8 aad[2 * AES_BLOCK_SIZE]; u8 b_0[AES_BLOCK_SIZE]; @@ -539,8 +565,6 @@ return RX_DROP_UNUSABLE; } - memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN); - /* Remove CCMP header and MIC */ if (pskb_trim(skb, skb->len - mic_len)) return RX_DROP_UNUSABLE;