On 17/08/17 14:39, Nikolay Aleksandrov wrote: > On 17/08/17 14:03, David Lamparter wrote: >> Hi Nikolay, >> >> >> On Wed, Aug 16, 2017 at 11:38:06PM +0300, Nikolay Aleksandrov wrote: >>> On 16/08/17 20:01, David Lamparter wrote: >>>> This implements holding dst metadata information in the bridge layer, >>>> but only for unicast entries in the MAC table. Multicast is still left >>>> to design and implement. >>> >>> Sorry but I do not agree with this change, adding a special case for >>> VPLS in the bridge code >> >> I don't think this is specific to VPLS at all, though you're right that >> VPLS is the only user currently. >> >>> and hitting the fast path for everyone in a few different places for a >>> feature that the majority will not use does not sound acceptable to >>> me. We've been trying hard to optimize it, trying to avoid additional >>> cache lines, removing tests and keeping special cases to a minimum. >> >> skb->dst is on the same cacheline as skb->len. >> fdb->md_dst is on the same cacheline as fdb->dst. >> Both will be 0 in a lot of cases, so this should be two null checks on >> data that is hot in the cache. Are you sure this is an actual problem? >> > > Sure - no, I haven't benchmarked it, but I don't see skb->len being on > the same cache line as _skb_refdst assuming 64 byte cache lines. I should've been clearer - that obviously depends on the kernel config, but in order for them to be in the same line you need to disable either one of conntrack, bridge_netfilter or xfrm, these are almost always enabled (at least in all major distributions). > But again any special cases, in my opinion, should be handled on their own, > it is both about the fast path and the code complexity that they bring in. > >>> I understand that you want to use the fdb tables and avoid >>> duplication, but this is not worth it. There're other similar use >>> cases and they have their own private fdb tables, that way the user >>> can opt out and is much cleaner and separated. >> >> Sure, this can be done. I think it's a noticeable performance penalty >> to have the entire fdb copied (multiple times for H-VPLS even), but I >> understand that it's preferable to have the normal cases faster in >> exchange. As the previous paragraph notes, I still wonder if that hit >> to the normal case exists though. >> >> I will leave this to Amine, he's paid to work on VPLS while I'm doing >> this for fun ;) >> >> There is however another concern I have here. As I noted in my >> introductory mail, I'm working on the bridge MDB making similar changes. >> And there's actually strong use cases for this in both VPLS and the >> 802.11 code (though I'm not sure I can code the latter one up, it's >> related to rate control and this is seriously complicated - the goal is >> to select a multicast rate based on the now-known receiving STAs). >> >> I really hope you're not suggesting the entire MDB with IPv4 & IPv6 >> snooping be duplicated into both VPLS and mac80211? >> > > Code can always be shared if there are more users, no need to stuff everything in > the bridge, but I'm not that familiar with this case, once patches are out I can > comment further. > >>> As you've noted this is only an RFC so I will not point out every issue, but there seems >>> to be a major problem with br_fdb_update(), note that it runs without any locks except RCU. >> >> Right, Thanks! ... I only thought about concurrent access, forgetting >> about concurrent modification... I'll replace it with an xchg I think. >> (No need for a lock that way) > > I think you can still lose references to a dst that way, what if someone changes the > dst you read before the xchg and you xchg it ? > >> >> That said, now that I think about it, the skb_dst_set_noref() in the >> following chunk is probably not safe if the VPLS device has a qdisc on >> it. I was relying on the fact that br_dev_xmit is holding RCU there, >> but if the SKB is queued, md_dst might go away in the meantime... >> ... probably need to change this to dst_hold() + skb_dst_set()... >> >> >> -David >> >>>> diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c >>>> index 861ae2a165f4..534cacf02f8d 100644 >>>> --- a/net/bridge/br_device.c >>>> +++ b/net/bridge/br_device.c >>>> @@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev) >>>> else >>>> br_flood(br, skb, BR_PKT_MULTICAST, false, true); >>>> } else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) { >>>> + struct dst_entry *md_dst = rcu_dereference(dst->md_dst); >>>> + if (md_dst) >>>> + skb_dst_set_noref(skb, md_dst); >>>> br_forward(dst->dst, skb, false, true); >>>> } else { >>>> br_flood(br, skb, BR_PKT_UNICAST, false, true); >> >> >>>> diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c >>>> index a79b648aac88..0751fcb89699 100644 >>>> @@ -567,10 +579,15 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source, >>>> source->dev->name, addr, vid); >>>> } else { >>>> unsigned long now = jiffies; >>>> + struct dst_entry *ref_md = rcu_access_pointer(fdb->md_dst); >>>> >>>> /* fastpath: update of existing entry */ >>>> - if (unlikely(source != fdb->dst)) { >>>> + if (unlikely(source != fdb->dst || >>>> + dst_metadata_cmp(md_dst, ref_md))) { >>>> fdb->dst = source; >>>> + dst_release(ref_md); >>>> + rcu_assign_pointer(fdb->md_dst, >>>> + dst_clone(md_dst)); >>>> fdb_modified = true; >>>> /* Take over HW learned entry */ >>>> if (unlikely(fdb->added_by_external_learn)) >