Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

Doug Ledford <dledford@xxxxxxxxxx> · Fri, 11 Sep 2015 16:09:49 -0400

On 09/11/2015 02:39 PM, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2015 at 09:21:05PM -0400, Doug Ledford wrote:
>> During the recent rework of the mcast handling in ipoib, the join
>> task for regular and send-only joins were merged.  In the old code,
>> the comments indicated that the ipoib driver didn't send enough
>> information to auto-create IB multicast groups when the join was a
>> send-only join.  The reality is that the comments said we didn't, but
>> we actually did.  Since we merged the two join tasks, we now follow
>> the comments and don't auto-create IB multicast groups for an ipoib
>> send-only multicast join.  This has been reported to cause problems
>> in certain environments that rely on this behavior.  Specifically,
>> if you have an IB <-> Ethernet gateway then there is a fundamental
>> mismatch between the methodologies used on the two fabrics.  On
>> Ethernet, an app need not subscribe to a multicast group, merely
>> listen.
> 
> This should probably be clarified. On all IP networks IGMP/MLD is used
> to advertise listeners.
> 
> A IB/Eth gateway is a router, and IP routers are expected to process
> IGMP - so the gateway certainly can (and maybe must) be copying
> groups declared with IGMP from the eth side into listeners on IB MGIDs

Obviously, the gateway in question currently is not doing this.

> We may also be making a mistake in IPoIB - if the send side MGID join
> fails, the send should probably go to the broadcast, and the join
> retried from time to time. This would at least let the gateway
> optimize a bit more by only creating groups in active use.

That's not *too* far off from what we do.  This change could be made
without a huge amount of effort.  Right now we queue up the packet and
retry the join on a timer.  Only after all of the join attempts have
failed their timed retries do we dequeue and drop the packets.  We could
drop the queue backlog entirely and just send to broadcast when the
multicast group is unsubscribed.

> That would emulate the ethernet philosophy of degrade to broadcast.

Indeed.

> TBH, fixing broken gateway devices by sending to broadcast appeals to
> me more than making a module option..

Well, we've already established that the gateway device might be well be
broken.  That makes one wonder if this will work or if it might be
broken too.

>> on the IB side of the gateway.  There are instances of installations
>> with 100's (maybe 1000's) of multicast groups where static creation
>> of all the groups is not practical that rely upon the send-only
> 
> With my last patch the SM now has enough information to auto-create
> the wonky send-only join attempts, if wanted to. It just needs to fill
> in tclass, so it certainly is possible to address this long term
> without a kernel patch.
> 
>> joins creating the IB multicast group in order to function, so to
>> preserve these existing installations, add a module option to the
>> ipoib module to restore the previous behavior.
> 
> .. but an option to restore behavior of an older kernel version makes
> sense - did we really have a kernel that completely converted a
> send-only join to full join&create?

I just re-checked to verify this and here's what I found:

kernels v3.10, v3,19, and v4.0 have this in ipoib_mcast_sendonly_join:

        if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
                ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
                return -EBUSY;
        }

        rec.mgid     = mcast->mcmember.mgid;
        rec.port_gid = priv->local_gid;
        rec.pkey     = cpu_to_be16(priv->pkey);

        comp_mask =
                IB_SA_MCMEMBER_REC_MGID         |
                IB_SA_MCMEMBER_REC_PORT_GID     |
                IB_SA_MCMEMBER_REC_PKEY         |
                IB_SA_MCMEMBER_REC_JOIN_STATE;

        mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
                                         priv->port, &rec,
                                         comp_mask,
                                         GFP_ATOMIC,
                                         ipoib_mcast_sendonly_join_complete,
                                         mcast);

kernel v4.1 and on no longer have sendonly_join as a separate function.
 So, to answer your question, no, upstream didn't do this.  This,
however, is carried in OFED:

        rec.mgid     = mcast->mcmember.mgid;
        rec.port_gid = priv->local_gid;
        rec.pkey     = cpu_to_be16(priv->pkey);

        comp_mask =
                IB_SA_MCMEMBER_REC_MGID         |
                IB_SA_MCMEMBER_REC_PORT_GID     |
                IB_SA_MCMEMBER_REC_PKEY         |
                IB_SA_MCMEMBER_REC_JOIN_STATE;

        if (priv->broadcast) {
                comp_mask |=
                        IB_SA_MCMEMBER_REC_QKEY                 |
                        IB_SA_MCMEMBER_REC_MTU_SELECTOR         |
                        IB_SA_MCMEMBER_REC_MTU                  |
                        IB_SA_MCMEMBER_REC_TRAFFIC_CLASS        |
                        IB_SA_MCMEMBER_REC_RATE_SELECTOR        |
                        IB_SA_MCMEMBER_REC_RATE                 |
                        IB_SA_MCMEMBER_REC_SL                   |
                        IB_SA_MCMEMBER_REC_FLOW_LABEL           |
                        IB_SA_MCMEMBER_REC_HOP_LIMIT;

                rec.qkey          = priv->broadcast->mcmember.qkey;
                rec.mtu_selector  = IB_SA_EQ;
                rec.mtu           = priv->broadcast->mcmember.mtu;
                rec.traffic_class = priv->broadcast->mcmember.traffic_class;
                rec.rate_selector = IB_SA_EQ;
                rec.rate          = priv->broadcast->mcmember.rate;
                rec.sl            = priv->broadcast->mcmember.sl;
                rec.flow_label    = priv->broadcast->mcmember.flow_label;
                rec.hop_limit     = priv->broadcast->mcmember.hop_limit;
        }

and so this has been happening since forever in OFED (the above is from
1.5.4.1).

Part of the confusion here is that I was on a concall with the people
this problem is effecting, and their claim was that this started with my
cleanup of the multicast stuff in the v4.1 kernel.  I double checked the
OFED sources at the time, but not the upstream kernel, and took them at
their word that my patches caused the problem to appear on upstream
kernels, which implies that the older upstream kernels, where the
sendonly_join that I deleted still existed, were like OFED.  If my patch
set *did* start the problem, then it's not because of the autocreate
issue.  So in that sense, this would be more like creating an OFED
compatibility hook.

>> +		 * An additional problem is that if we auto-create the IB
>> +		 * mcast group in response to a send-only action, then we
>> +		 * will be the creating entity, but we will not have any
>> +		 * mechanism by which we will track when we should leave
>> +		 * the group ourselves.  We will occasionally leave and
>> +		 * re-join the group when these events occur:
> 
> I would drop the langauge of creating-entity, that isn't something
> from the IBA.
> 
> The uncontrolled lifetime of the join is not related to creating or
> not.
> 
>> +		 * 2) a regular mcast join/leave happens and we run
>> +		 *    ipoib_mcast_restart_task
> 
> Really? All send only mgids are discarded at that point? Ugly.

Yep.  Any time an app joins/leaves a multicast group, it updates our
multicast list in the net core, and via our net update we schedule a
multicast_restart_task().  When that happens, all full joins are matched
against the new list of joined groups from the net core, any that no
longer match a subscribed group we put on the remove list, any that
match we leave on the list, any that are now in the net core and not in
our list we add.  Because the sendonly groups are not tracked at the net
core level, our only option is to move them all to the remove list and
when we get another sendonly packet, rejoin.  Unless we want them to
stay around forever.  But since they aren't real send-only joins, they
are full joins where we simply ignore the incoming data, leaving them
around seems a bad idea.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: 0E572FDD

Attachment:
signature.asc

Description: OpenPGP digital signature