On 2/9/2021 2:15 PM, Jason Gunthorpe wrote:
On Thu, Jan 28, 2021 at 06:46:47PM +0000, Christoph Lameter wrote:
From 64e734c38f509d591073fc1e1db3caa42be3b874 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@xxxxxxxxx>
Date: Thu, 28 Jan 2021 14:55:36 +0000
Subject: [PATCH] Fix: Remove racy Subnet Manager sendonly join checks
When a system receives a REREG event from the SM, then the SM information in
the kernel is marked as invalid and a request is sent to the SM to update
the information. The SM information is invalid in that time period.
However, receiving a REREG also occurs simultaneously in user space
applications that are now trying to rejoin the multicast groups. Some of those
may be sendonly multicast groups which are then failing.
If the SM information is invalid then ib_sa_sendonly_fullmem_support()
returns false. That is wrong because it just means that we do not know
yet if the potentially new SM supports sendonly joins.
Sendonly join was introduced in 2015 and all the Subnet managers have
supported it ever since. So there is no point in checking if a subnet
manager supports it.
Should an old opensm get a request for a sendonly join then the request
will fail. The code that is removed here accomodated that situation
and fell back to a full join.
Falling back to a full join is problematic in itself. The reason to
use the sendonly join was to reduce the traffic on the Infiniband
fabric otherwise one could have just stayed with the regular join.
So this patch may cause users of very old opensms to discover that
lots of traffic needlessly crosses their IB fabrics.
Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>
---
drivers/infiniband/core/cma.c | 11 ---------
drivers/infiniband/core/sa_query.c | 24 -------------------
drivers/infiniband/ulp/ipoib/ipoib.h | 1 -
drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 --
.../infiniband/ulp/ipoib/ipoib_multicast.c | 13 +---------
5 files changed, 1 insertion(+), 50 deletions(-)
This one got spam filtered and didn't make it to the list:
Received-SPF: SoftFail (hqemgatev14.nvidia.com: domain of
cl@xxxxxxxxx is inclined to not designate 3.19.106.255 as
permitted sender) identity=mailfrom; client-ip=3.19.106.255;
receiver=hqemgatev14.nvidia.com;
envelope-from="cl@xxxxxxxxx"; x-sender="cl@xxxxxxxxx";
x-conformance=spf_only; x-record-type="v=spf1"
Also the extra From/Date/Subject ended up in the commit message
I fixed it all up, applied to for-next
It looks like OPA will also suffer this race (opa_pr_query_possible),
maybe it is a little less likely since it will be driven by PR queries
not broadcast joins.
But the same logic is likely true there, I'd be surprised if OPA
fabrics are not running a capable OPA SM at this point.
OPA supports SENDONLY joins.
-Denny