Hi Andrew, Thanks for the review. Comments and fix links are inline below. On Sun, Oct 21, 2018 at 12:47 AM Andrew Lunn <andrew@xxxxxxx> wrote: > > > +#define choose_node(parent, key) \ > > + parent->bit[(key[parent->bit_at_a] >> parent->bit_at_b) & 1] > This should be a function, not a macro. That prevents it from being used as an lvalue, which is mostly where it's used, with the rcu-protected structure, so this will remain a macro. I'll make it upper-case though. > > + > > +static void node_free_rcu(struct rcu_head *rcu) > > +{ > > + kfree(container_of(rcu, struct allowedips_node, rcu)); > > +} > > + > > +#define push_rcu(stack, p, len) ({ \ > > + if (rcu_access_pointer(p)) { \ > > + WARN_ON(IS_ENABLED(DEBUG) && (len) >= 128); \ > > + stack[(len)++] = rcu_dereference_raw(p); \ > > + } \ > > + true; \ > > + }) > > This also looks like it could be a function. You're right. I've changed this now. That produces slightly worse code on gcc 8.2, but that's not really hot path anyway, so it doesn't matter. > > +static void root_free_rcu(struct rcu_head *rcu) > > +{ > > + struct allowedips_node *node, *stack[128] = { > > + container_of(rcu, struct allowedips_node, rcu) }; > > + unsigned int len = 1; > > + > > + while (len > 0 && (node = stack[--len]) && > > + push_rcu(stack, node->bit[0], len) && > > + push_rcu(stack, node->bit[1], len)) > > + kfree(node); > > +} > > + > > +#define ref(p) rcu_access_pointer(p) > > +#define deref(p) rcu_dereference_protected(*(p), lockdep_is_held(lock)) > > Macros should be uppercase, or better still, functions. > > > +#define push(p) ({ \ > > + WARN_ON(IS_ENABLED(DEBUG) && len >= 128); \ > > + stack[len++] = p; \ > > + }) > > This one definitely should be upper case, to warn readers it has > unexpected side effects. I've made these little helper macros uppercase. > > > + > > +static void walk_remove_by_peer(struct allowedips_node __rcu **top, > > + struct wg_peer *peer, struct mutex *lock) > > +{ > > + struct allowedips_node __rcu **stack[128], **nptr; > > + struct allowedips_node *node, *prev; > > + unsigned int len; > > + > > + if (unlikely(!peer || !ref(*top))) > > + return; > > + > > + for (prev = NULL, len = 0, push(top); len > 0; prev = node) { > > + nptr = stack[len - 1]; > > + node = deref(nptr); > > + if (!node) { > > + --len; > > + continue; > > + } > > + if (!prev || ref(prev->bit[0]) == node || > > + ref(prev->bit[1]) == node) { > > + if (ref(node->bit[0])) > > + push(&node->bit[0]); > > + else if (ref(node->bit[1])) > > + push(&node->bit[1]); > > + } else if (ref(node->bit[0]) == prev) { > > + if (ref(node->bit[1])) > > + push(&node->bit[1]); > > + } else { > > + if (rcu_dereference_protected(node->peer, > > + lockdep_is_held(lock)) == peer) { > > + RCU_INIT_POINTER(node->peer, NULL); > > + if (!node->bit[0] || !node->bit[1]) { > > + rcu_assign_pointer(*nptr, > > + deref(&node->bit[!ref(node->bit[0])])); > > + call_rcu_bh(&node->rcu, node_free_rcu); > > + node = deref(nptr); > > + } > > + } > > + --len; > > + } > > + } > > +} > > + > > +#undef ref > > +#undef deref > > +#undef push > > + > > +static __always_inline unsigned int fls128(u64 a, u64 b) > > +{ > > + return a ? fls64(a) + 64U : fls64(b); > > +} > > Does the compiler actually get this wrong? Not inline it? Actually for this function (in contrast with all of the other ones marked as such, which really must have the annotation), recent gcc gets it right, so I've removed the annotation. Nice catch. > > + > > +static __always_inline u8 common_bits(const struct allowedips_node *node, > > + const u8 *key, u8 bits) > > +{ > > + if (bits == 32) > > + return 32U - fls(*(const u32 *)node->bits ^ *(const u32 *)key); > > + else if (bits == 128) > > + return 128U - fls128( > > + *(const u64 *)&node->bits[0] ^ *(const u64 *)&key[0], > > + *(const u64 *)&node->bits[8] ^ *(const u64 *)&key[8]); > > + return 0; > > +} > > + > > +/* This could be much faster if it actually just compared the common bits > > + * properly, by precomputing a mask bswap(~0 << (32 - cidr)), and the rest, but > > + * it turns out that common_bits is already super fast on modern processors, > > + * even taking into account the unfortunate bswap. So, we just inline it like > > + * this instead. > > + */ > > +#define prefix_matches(node, key, bits) \ > > + (common_bits(node, key, bits) >= (node)->cidr) > > Could be a function. I've made this a function now and confirmed codegen did not change significantly. > > +/* Returns a strong reference to a peer */ > > +static __always_inline struct wg_peer * > > +lookup(struct allowedips_node __rcu *root, u8 bits, const void *be_ip) > > +{ > > + u8 ip[16] __aligned(__alignof(u64)); > > You virtually never see aligned stack variables. This needs some sort > of comment. I've added a comment. The reason, if you're curious, is so that we can pass it to fls64 on all platforms. > > +__attribute__((nonnull(1))) static bool > > +node_placement(struct allowedips_node __rcu *trie, const u8 *key, u8 cidr, > > + u8 bits, struct allowedips_node **rnode, struct mutex *lock) > > +{ > > + struct allowedips_node *node = rcu_dereference_protected(trie, > > + lockdep_is_held(lock)); > > + struct allowedips_node *parent = NULL; > > + bool exact = false; > > Should there be a WARN_ON(!key) here, since the attribute will only > detect problems at compile time, and maybe some runtime cases will get > passed it? Actually no, it's only used in one place, and that function is pretty clear about checking beforehand, and even this function checks implicitly. Looking at my commit log, I had added the annotation to make clang-analyzer happy, because it couldn't (at the time) deduce things correctly from rcu_access_pointer(p) checks. It looks like recent clang-analyzer has fixed that, and besides I'm not sure the kernel is in the business of adding annotations to work around static-analyzer-bug-of-the-week. So I've gotten rid of the annotation. Thanks for bringing my attention to it. > > + net_dbg_ratelimited("%s: Could not decrypt invalid cookie response\n", > > + wg->dev->name); > > It might be worth adding a netdev_dbg_ratelimited(), which takes a > netdev as its first parameter, just line netdev_dbg(). That sounds like it could be useful. Though, I'm trying hard to make the first patch submission _not_ need to touch any of the rest of the networking stack. I've actually got a small list of interesting networking stack changes that might be useful for WireGuard, but I think I'd prefer to submit these after this is all merged, and each one separately for a separate discussion, if that's okay with you. > > + > > +#if defined(CONFIG_PM_SLEEP) && !defined(CONFIG_ANDROID) > > I don't see any other code which uses this combination. Why is this > needed? WireGuard clears private key material before going to sleep, so that ephemeral keys never live longer in ram than their expiration date. This works well for mostly everything, except Android devices where crazy out-of-tree wireless drivers (such as qcacld) will actually wake and sleep the phone constantly, for a few milliseconds each wake, to handle incoming packets. In this case, we must not zero out ephemeral keys, so that these packets can actually be decrypted. There are no other systems that have this characteristic, as far as I can tell, and I'm certainly not willing to add a runtime configuration nob for that, and it turns out that matching on CONFIG_ANDROID as such does the right thing all of the time. If the landscape becomes more complicated, we can visit it, probably in the form of passing reasons through to pm_notification and related scaffolding. But in lieu of this infrastructure someday existing, I prefer to do the simple-and-always-works-anyway solution. > > + while (skb_queue_len(&peer->staged_packet_queue) > MAX_STAGED_PACKETS) > > + dev_kfree_skb(__skb_dequeue(&peer->staged_packet_queue)); > > It would be good to have some stats counters in here. Maybe the > standard interface statistics will cover it, otherwise ethtool -S. Good point, I need to increment the drop counters there. I'll add that; good catch. Willy and I actually discussed around a year ago adding a series of wireguard-specific error counters for all the various unusual things that can happen. This is something I'll probably investigate in earnest post-merge, as it's a bit of a "bell and whistle" thing, and the existing counters now are already sufficient for most purposes. > You should also get this code looked at by some of the queueing > people. Rather than discarding, you might want to be applying back > pressure to slow down the sender application? Actually Toke, DaveT, and I have agonized quite a bit over the queueing mechanisms in WireGuard, as well as two GSoC summers looking at it with students. There's been lots of flent and rrul and tweaking involved, and the entire queueing system has gone through probably 15 different revisions of trying things out and experimenting. At this point, I think the way we're doing queueing and when we're dropping, and those various factors, are taken care of pretty well. Future work still involves considering shoehorning some fq_codel in a similar way that wireless does, and a few other tweeks, but I think the current system is good and stable and very much sufficient. The particular instance you're pointing out though, with staged_packet_queue, is not the relevant queue and isn't what you have in mind with regards to backpressure. This is simply a very short lived thing for dealing with packets before a handshake is made, and for handling xmit being used from multiple cores, but it's not _the_ queueing system that's relevant in the actual buffering and latency calculations. A little bit of the queueing stuff is discussed in the paper I presented at Netdev 2.2, if you're interested: https://www.wireguard.com/papers/wireguard-netdev22.pdf Some of that is outdated by now, but should give you some idea. > > +static void kdf(u8 *first_dst, u8 *second_dst, u8 *third_dst, const u8 *data, > > + size_t first_len, size_t second_len, size_t third_len, > > + size_t data_len, const u8 chaining_key[NOISE_HASH_LEN]) > > +{ > > + u8 output[BLAKE2S_HASH_SIZE + 1]; > > + u8 secret[BLAKE2S_HASH_SIZE]; > > + > > + WARN_ON(IS_ENABLED(DEBUG) && > > + (first_len > BLAKE2S_HASH_SIZE || > > + second_len > BLAKE2S_HASH_SIZE || > > + third_len > BLAKE2S_HASH_SIZE || > > + ((second_len || second_dst || third_len || third_dst) && > > + (!first_len || !first_dst)) || > > + ((third_len || third_dst) && (!second_len || !second_dst)))); > > Maybe split this up into a number of WARN_ON()s. At the moment, if it > fires, you have little idea why, there are so many comparisons. Also, > is this on the hot path? I guess not, since this is keys, not > packets. Do you need to care about DEBUG here? This is on the hot path, actually. Well, it's not on path of data packets, but I do consider handshake packets to be fairly "warm". I'm not especially keen on splitting that up into multiple warn_ons, mostly because if that is ever hint, something is seriously wrong, and the whole flow would likely then need auditing anyway. Regards, Jason