Re: [PATCH net] net/smc: Fix lookup of netdev by using ib_device_get_netdev()

Tom Talpey <tom@xxxxxxxxxx> · Wed, 8 Jan 2025 12:27:15 -0500

On 1/8/2025 4:31 AM, Leon Romanovsky wrote:
On Tue, Jan 07, 2025 at 10:51:19PM +0000, Kangjing Huang wrote:
On Thu, Dec 19, 2024 at 4:56 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:

On Sat, Dec 14, 2024 at 08:02:14AM +0000, Kangjing Huang wrote:
On Sat, Dec 14, 2024 at 1:06 AM Leon Romanovsky <leon@xxxxxxxxxx> wrote:

On Sat, Dec 14, 2024, at 04:33, Namjae Jeon wrote:
On Fri, Dec 13, 2024 at 8:07 PM Kangjing Huang <huangkangjing@xxxxxxxxx> wrote:

Hi there,

I am the original author of commit ecce70cf17d9 ("ksmbd: fix missing
RDMA-capable flag for IPoIB device in ksmbd_rdma_capable_netdev()"),
as mentioned in the thread.

I am working on modifying the patch to take care of the layering
violation. The original patch was meant to fix an issue with ksmbd,
where an IPoIB netdev was not recognized as RDMA-capable. The original
version of the capability evaluation tries to match each netdev to
ib_device by calling get_netdev in ib verbs. However this only works
in cases where the ib_device is the upper layer of netdev (e.g. RoCE),
and since with IPoIB it is the other way around (netdev is the upper
layer of ib_device), get_netdev won't work anymore.

I tried to replicate the behavior of device matching reversely in the
original version of my patch using GID, which ended up as the layering
violation. However I am unaware of any exported functions from the
IPoIB driver that could do the reverse lookup from netdev to the lower
layer ib_device. Actually it seems that the IPoIB driver does not have
any exported symbols at all.

It might be that the device matching in reverse just does not make any
sense and does not need to be done at all. As long as it is an IPoIB
device (netdev->type == ARPHRD_INFINIBAND) it might be ok to just
automatically assume it is RDMA-capable. I am not 100% sure about this
though.
Why can't we assume RDMA-capable if it's ARPHRD_INFINIBAND type?
How about assuming it's RDMA-capable and allowing users to turn
RDMA-capable on/off via sysfs?
It does make more sense to me at this point to just broadly assume all
ARPHRD_INFINIBAND types to be RDMA-capable, we just need to make sure
this assumption indeed holds and figure out to what extent this could
involve the same layering violation.

Any attempt to treat ipoib differently from regular netdevice is wrong by definition.

I would agree that the design direction to treat ipoib as a pure
regular net_device is the good way to go. But the problem with ksmbd
and ipoib devices stems from the SMB protocol itself.

In contrast to protocols that focus on certain functionalities like
nfs, SMB actually tries to manage network interfaces actively in the
protocol itself: SMB protocol's RDMA support (dubbed SMB Direct) is a
sub-feature of SMB Multichannel. Multichannel is designed to let
client and server find multiple data paths automatically (imagine a
pair of hosts with multiple adapters connected by multiple cables) to
increase bandwidth. So client can initiate a
FSCTL_QUERY_NETWORK_INTERFACE_INFO request and server is expected to
respond with NETWORK_INTERFACE_INFO containing _all_ local network
interface informations, including their capabilities such as
RDMA_CAPABLE (for details see ref [MS-SMB2] 3.3.5.15.11) Only upon
seeing the capability flag would a client attempt to initiate a RDMA
connection.

Reference: [MS-SMB2](https://winprotocoldoc.z19.web.core.windows.net/MS-SMB2/%5bMS-SMB2%5d.pdf)

TLDR is that the SMB protocol requires the server to enumerate all
net_devices and indicate their RDMA capability, and
ksmbd_rdma_capable_netdev() is only used in that process. Given such
context, I wonder what should be the best way to approach this? Is
using ARPHRD_INFINIBAND good enough and acceptable in terms of
layering?

The thing is that ARPHRD_INFINIBAND indeed represent IPoIB and it is
right check if netdev is IPoIB or not. The layering problem is that
upper layers (ULPs) should use it as regular netdevice.

This is good to know. However, since the SMB protocol explicitly calls
for enumeration of all network interfaces on the server host,
including their RDMA capabilities, I believe this is a sensible
exception to the layering rule. Or is there anyway else to do this
enumeration from the kernel space?

Or we can give up implementing the full spec of the SMB protocol and
call for explicit configuration from user space on how to respond to
the IOCTL requests in question. Which one looks more sensible to you?

My preference is to have same IPoIB treatment for all ULPs, including SMB.

My GUESS is that SMB specification authors didn't take into account HW and
Linux SW development around IPoIB and weren't aware of IPoIB offload which
is implemented and enabled by default in all modern IB NICs and Linux OSes.

The SMB3 specification is completely unconcerned with IPoIB and any
other layer-2 or layer-3 implementation details. It merely discusses
an exchange of network interface capabilities such as link speed and
RDMA support. The SMB3 client uses this list to implement multichannel.

I totally agree that inspecting ARPHRD_INFINIBAND is an incorrect method
of building this list. Just because an interface supports IPoIB does not
mean it also exposes RDMA, especially in-kernel. And that ignores any
non-IB transport too of course.

Kangjing, please educate me if I'm confused here, but doesn't the
code in ksmbd_rdma_capable_netdev() look up the ib_device anyway, at
the end of the function?

	if (rdma_capable == false) {
		struct ib_device *ibdev;

		ibdev = ib_device_get_by_netdev(netdev, RDMA_DRIVER_UNKNOWN);
		if (ibdev) {
			if (rdma_frwr_is_supported(&ibdev->attrs))
				rdma_capable = true;
			ib_device_put(ibdev);
		}
	}

	return rdma_capable;

So, why is the code concerned at all with ARPHRD_INFINIBAND just a few
lines above? And why does it look in the smb_direct_device_list first?

Tom.

That offload allows line-rate for IPoIB, something that is not possible
for SW IPoIB.

Thanks

Thanks

Thanks

Thanks!

I am uncertain about how to proceed at this point and would like to
know your thoughts and opinions on this.

Thanks,
Kangjing

On Fri, Nov 8, 2024 at 5:59 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:

On Fri, Nov 08, 2024 at 08:40:40AM +0900, Namjae Jeon wrote:
On Thu, Nov 7, 2024 at 9:00 PM Halil Pasic <pasic@xxxxxxxxxxxxx> wrote:

On Wed, 6 Nov 2024 15:59:10 +0200
Leon Romanovsky <leon@xxxxxxxxxx> wrote:

Does  fs/smb/server/transport_rdma.c qualify as inside of RDMA core code?

RDMA core code is drivers/infiniband/core/*.

Understood. So this is a violation of the no direct access to the
callbacks rule.

I would guess it is not, and I would not actually mind sending a patch
but I have trouble figuring out the logic behind  commit ecce70cf17d9
("ksmbd: fix missing RDMA-capable flag for IPoIB device in
ksmbd_rdma_capable_netdev()").

It is strange version of RDMA-CM. All other ULPs use RDMA-CM to avoid
GID, netdev and fabric complexity.

I'm not familiar enough with either of the subsystems. Based on your
answer my guess is that it ain't outright bugous but still a layering
violation. Copying linux-cifs@xxxxxxxxxxxxxxx so that
the smb are aware.
Could you please elaborate what the violation is ?

There are many, but the most screaming is that ksmbd has logic to
differentiate IPoIB devices. These devices are pure netdev devices
and should be treated like that. ULPs should treat them exactly
as they treat netdev devices.

I would also appreciate it if you could suggest to me how to fix this.

Thanks.

Thank you very much for all the explanations!

Regards,
Halil

--
Kangjing "Chaser" Huang

--
Kangjing "Chaser" Huang

--
Kangjing "Chaser" Huang