On 09/11/2015 02:39 PM, Jason Gunthorpe wrote: > On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote: >> During the recent rework of the mcast handling in ipoib, the join >> task for regular and send-only joins were merged. In the old code, >> the comments indicated that the ipoib driver didn't send enough >> information to auto-create IB multicast groups when the join was a >> send-only join. The reality is that the comments said we didn't, but >> we actually did. Since we merged the two join tasks, we now follow >> the comments and don't auto-create IB multicast groups for an ipoib >> send-only multicast join. This has been reported to cause problems >> in certain environments that rely on this behavior. Specifically, >> if you have an IB <-> Ethernet gateway then there is a fundamental >> mismatch between the methodologies used on the two fabrics. On >> Ethernet, an app need not subscribe to a multicast group, merely >> listen. > > This should probably be clarified. On all IP networks IGMP/MLD is used > to advertise listeners. > > A IB/Eth gateway is a router, and IP routers are expected to process > IGMP - so the gateway certainly can (and maybe must) be copying > groups declared with IGMP from the eth side into listeners on IB MGIDs Obviously, the gateway in question currently is not doing this. > We may also be making a mistake in IPoIB - if the send side MGID join > fails, the send should probably go to the broadcast, and the join > retried from time to time. This would at least let the gateway > optimize a bit more by only creating groups in active use. That's not *too* far off from what we do. This change could be made without a huge amount of effort. Right now we queue up the packet and retry the join on a timer. Only after all of the join attempts have failed their timed retries do we dequeue and drop the packets. We could drop the queue backlog entirely and just send to broadcast when the multicast group is unsubscribed. > That would emulate the ethernet philosophy of degrade to broadcast. Indeed. > TBH, fixing broken gateway devices by sending to broadcast appeals to > me more than making a module option.. Well, we've already established that the gateway device might be well be broken. That makes one wonder if this will work or if it might be broken too. >> on the IB side of the gateway. There are instances of installations >> with 100's (maybe 1000's) of multicast groups where static creation >> of all the groups is not practical that rely upon the send-only > > With my last patch the SM now has enough information to auto-create > the wonky send-only join attempts, if wanted to. It just needs to fill > in tclass, so it certainly is possible to address this long term > without a kernel patch. > >> joins creating the IB multicast group in order to function, so to >> preserve these existing installations, add a module option to the >> ipoib module to restore the previous behavior. > > .. but an option to restore behavior of an older kernel version makes > sense - did we really have a kernel that completely converted a > send-only join to full join&create? I just re-checked to verify this and here's what I found: kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join: if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) { ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n"); return -EBUSY; } rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port, &rec, comp_mask, GFP_ATOMIC, ipoib_mcast_sendonly_join_complete, mcast); kernel v4.1 and on no longer have sendonly_join as a separate function. So, to answer your question, no, upstream didn't do this. This, however, is carried in OFED: rec.mgid = mcast->mcmember.mgid; rec.port_gid = priv->local_gid; rec.pkey = cpu_to_be16(priv->pkey); comp_mask = IB_SA_MCMEMBER_REC_MGID | IB_SA_MCMEMBER_REC_PORT_GID | IB_SA_MCMEMBER_REC_PKEY | IB_SA_MCMEMBER_REC_JOIN_STATE; if (priv->broadcast) { comp_mask |= IB_SA_MCMEMBER_REC_QKEY | IB_SA_MCMEMBER_REC_MTU_SELECTOR | IB_SA_MCMEMBER_REC_MTU | IB_SA_MCMEMBER_REC_TRAFFIC_CLASS | IB_SA_MCMEMBER_REC_RATE_SELECTOR | IB_SA_MCMEMBER_REC_RATE | IB_SA_MCMEMBER_REC_SL | IB_SA_MCMEMBER_REC_FLOW_LABEL | IB_SA_MCMEMBER_REC_HOP_LIMIT; rec.qkey = priv->broadcast->mcmember.qkey; rec.mtu_selector = IB_SA_EQ; rec.mtu = priv->broadcast->mcmember.mtu; rec.traffic_class = priv->broadcast->mcmember.traffic_class; rec.rate_selector = IB_SA_EQ; rec.rate = priv->broadcast->mcmember.rate; rec.sl = priv->broadcast->mcmember.sl; rec.flow_label = priv->broadcast->mcmember.flow_label; rec.hop_limit = priv->broadcast->mcmember.hop_limit; } and so this has been happening since forever in OFED (the above is from 1.5.4.1). Part of the confusion here is that I was on a concall with the people this problem is effecting, and their claim was that this started with my cleanup of the multicast stuff in the v4.1 kernel. I double checked the OFED sources at the time, but not the upstream kernel, and took them at their word that my patches caused the problem to appear on upstream kernels, which implies that the older upstream kernels, where the sendonly_join that I deleted still existed, were like OFED. If my patch set *did* start the problem, then it's not because of the autocreate issue. So in that sense, this would be more like creating an OFED compatibility hook. >> + * An additional problem is that if we auto-create the IB >> + * mcast group in response to a send-only action, then we >> + * will be the creating entity, but we will not have any >> + * mechanism by which we will track when we should leave >> + * the group ourselves. We will occasionally leave and >> + * re-join the group when these events occur: > > I would drop the langauge of creating-entity, that isn't something > from the IBA. > > The uncontrolled lifetime of the join is not related to creating or > not. > >> + * 2) a regular mcast join/leave happens and we run >> + * ipoib_mcast_restart_task > > Really? All send only mgids are discarded at that point? Ugly. Yep. Any time an app joins/leaves a multicast group, it updates our multicast list in the net core, and via our net update we schedule a multicast_restart_task(). When that happens, all full joins are matched against the new list of joined groups from the net core, any that no longer match a subscribed group we put on the remove list, any that match we leave on the list, any that are now in the net core and not in our list we add. Because the sendonly groups are not tracked at the net core level, our only option is to move them all to the remove list and when we get another sendonly packet, rejoin. Unless we want them to stay around forever. But since they aren't real send-only joins, they are full joins where we simply ignore the incoming data, leaving them around seems a bad idea. -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: 0E572FDD
Attachment:
signature.asc
Description: OpenPGP digital signature