[PATCH] IB2ROCE RDMA Gateway/Bridge: Initial release, multicast only

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We talked about an implementation of a ROCE to IB gateway
awhile back. See https://lore.kernel.org/all/YZDSMCSmixPdS8hf@unreal/T/

Attached is a patch that implements such a gateway. For now the implementation
is only supporting multicast but we hope to have unicast support in
some form or another soon. There is also a github project.
See https://github.com/clameter/rdma-core/tree/ib2roce

Comments and feedback welcome.


From: cl@xxxxxxxxx
Subject: [PATCH] Initial version of ib2roce.

Supports only multicast for now.

Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>

diff --git a/CMakeLists.txt b/CMakeLists.txt
index e9d1f463..3d0f39cf 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -701,6 +701,7 @@ endif()
 # Binaries
 if (NOT NL_KIND EQUAL 0)
   add_subdirectory(ibacm) # NO SPARSE
+  add_subdirectory(ib2roce)
 endif()

 if (NOT NL_KIND EQUAL 0)
diff --git a/debian/control b/debian/control
index d9fec3a5..6e387a2c 100644
--- a/debian/control
+++ b/debian/control
@@ -64,6 +64,16 @@ Description: InfiniBand Communication Manager Assistant (ACM)
  needed to establish a connection, but does not implement the CM protocol.
  A primary user of the ibacm service is the librdmacm library.

+Package: ib2roce
+Architecture: linux-any
+Depends: lsb-base (>= 3.2-14~),
+         rdma-core (>= 15),
+         ${misc:Depends},
+         ${shlibs:Depends}
+Description: Infiniband to ROCE bridge (ib2roce)
+ ib2roce implements a forwarder between Infiniband and ROCE with limited
+ functionality.
+
 Package: ibverbs-providers
 Architecture: linux-any
 Multi-Arch: same
diff --git a/debian/ib2roce.install b/debian/ib2roce.install
new file mode 100644
index 00000000..7330f082
--- /dev/null
+++ b/debian/ib2roce.install
@@ -0,0 +1,3 @@
+usr/bin/ib2roce
+usr/share/man/man1/ib2roce.1
+usr/share/man/man7/ib2roce.7
diff --git a/ib2roce/CMakeLists.txt b/ib2roce/CMakeLists.txt
new file mode 100644
index 00000000..faeb836c
--- /dev/null
+++ b/ib2roce/CMakeLists.txt
@@ -0,0 +1,19 @@
+include_directories("linux")
+include_directories(${NL_INCLUDE_DIRS})
+
+rdma_executable(ib2roce
+  ib2roce.c
+  )
+target_link_libraries(ib2roce LINK_PRIVATE
+  ibverbs
+  ibumad
+  rdmacm
+  ${NL_LIBRARIES}
+  ${SYSTEMD_LIBRARIES}
+  ${CMAKE_THREAD_LIBS_INIT}
+  ${CMAKE_DL_LIBS}
+  )
+rdma_man_pages(
+  man/ib2roce.1
+  man/ib2roce.7
+  )
diff --git a/ib2roce/ib2roce.c b/ib2roce/ib2roce.c
new file mode 100644
index 00000000..385ea11d
--- /dev/null
+++ b/ib2roce/ib2roce.c
@@ -0,0 +1,1732 @@
+/*
+ * RDMA Infiniband to ROCE Bridge or Gateway
+ *
+ * (C) 2021-2022 Christoph Lameter <cl@xxxxxxxxx>
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * $Author: Christoph Lameter [cl@xxxxxxxxx]$
+ *
+ */
+
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <syslog.h>
+#include <fcntl.h>
+
+#include <arpa/inet.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <getopt.h>
+#include <sys/ioctl.h>
+#include <net/if.h>
+#include <rdma/rdma_cma.h>
+#include <infiniband/ib.h>
+#include <infiniband/verbs.h>
+#include <poll.h>
+
+#define VERSION "2022.0120"
+
+enum bool { false, true };
+#define MIN(a,b) (((a)<(b))?(a):(b))
+
+/* Globals */
+
+static unsigned default_port = 4711;		/* Port to use to bind to devices and for MC groups that do not have a port (if a port is required) */
+static enum bool debug = false;			/* Stay in foreground, print more details */
+static enum bool terminated = false;		/* Daemon received a signal to terminate */
+static enum bool update_requested = false;	/* Received SIGUSR1. Dump all MC data details */
+static enum bool beacon = false;		/* Announce our presence (and possibly coordinate between multiple instances in the future */
+static enum bool bridging = true;		/* Allow briding */
+
+/*
+ * Handling of special Multicast Group MGID encodings on Infiniband
+ */
+#define nr_mgid_signatures 4
+
+struct mgid_signature {		/* Manage different MGID formats used */
+	unsigned short signature;
+	const char *id;
+	enum bool port_in_mgid;
+	enum bool full_ipv4;
+} mgid_signatures[nr_mgid_signatures] = {
+	{	0x401B,	"IPv4",	false, false },
+	{	0x601B,	"IPv6",	false, false },
+	{	0xA01B,	"CLLM", true, true },
+	{	0x4001, "IB",	false, false }
+};
+
+struct mgid_signature *mgid_mode;
+
+/*
+ * Basic RDMA interface management
+ */
+
+#define MAX_GID 20
+#define MAX_INLINE_DATA 64
+
+static char *ib_name, *roce_name;
+
+enum interfaces { INFINIBAND, ROCE, NR_INTERFACES };
+
+static const char *interfaces_text[NR_INTERFACES] = { "Infiniband", "ROCE" };
+
+enum stats { packets_received, packets_sent, packets_bridged,
+		join_requests, join_failure, join_success,
+	        leave_requests,
+		nr_stats
+};
+
+static const char *stats_text[nr_stats] = {
+	"PacketsReceived", "PacketsSent", "PacketsBridged",
+	"JoinRequests", "JoinFailures", "JoinSuccess",
+	"LeaveRequests"
+};
+
+static int cq_high = 0;	/* Largest batch of CQs encountered */
+
+static struct i2r_interface {
+	struct rdma_cm_id *id;
+	struct rdma_event_channel *rdma_events;
+	struct ibv_context *context;
+	struct ibv_comp_channel *comp_events;
+	struct ibv_pd *pd;
+	struct ibv_cq *cq;
+	struct ibv_mr *mr;
+	unsigned int active_receive_buffers;
+	unsigned int nr_cq;
+	unsigned port;
+	unsigned mtu;
+	char if_name[IFNAMSIZ];
+	struct sockaddr_in if_addr;
+	struct sockaddr *bindaddr;
+	unsigned ifindex;
+	unsigned gid_index;
+	union ibv_gid gid;
+	unsigned long stats[nr_stats];
+	struct ibv_device_attr device_attr;
+	struct ibv_port_attr port_attr;
+	int iges;
+	struct ibv_gid_entry ige[MAX_GID];
+} i2r[NR_INTERFACES];
+
+static inline void st(struct i2r_interface *i, enum stats s)
+{
+	i->stats[s]++;
+}
+
+static inline struct rdma_cm_id *id(enum interfaces i)
+{
+	return i2r[i].id;
+}
+
+/* Check the RDMA device if it fits what was specified on the command line and store it if it matches */
+static int check_rdma_device(enum interfaces i, int port, char *name,
+	       struct ibv_context *c, struct ibv_port_attr *a, struct ibv_device_attr *d)
+{
+	char *s;
+	int p = 1;
+
+	if (i2r[i].context)
+		/* Already found a match */
+		return 0;
+
+	if (!name)
+		/* No command line option, take the first port/device */
+		goto success;
+
+	/* Port / device specified */
+	s = strchrnul(name,':');
+	if (*s)
+		/* Portnumber follows device name */
+		p = atoi(s + 1);
+
+	if (strncmp(name, ibv_get_device_name(c->device), s - name))
+		return 0;
+
+	if (port != p)
+		return 0;
+
+success:
+	if (a->active_mtu == IBV_MTU_4096)
+		i2r[i].mtu = 4096;
+	else if (a->active_mtu == IBV_MTU_2048)
+		i2r[i].mtu = 2048;
+	else
+		/* Other MTUs are not supported */
+		return 0;
+
+	i2r[i].context = c;
+	i2r[i].port = port;
+	i2r[i].port_attr = *a;
+	i2r[i].device_attr = *d;
+	return 1;
+}
+
+/* Scan through available RDMA devices in order to locate the devices for bridging */
+static int find_rdma_devices(void)
+{
+	int nr;
+	int i;
+	struct ibv_device **list;
+
+	list = ibv_get_device_list(&nr);
+
+	if (nr <= 0) {
+		syslog(LOG_CRIT, "No RDMA devices present.\n");
+		return 1;
+	}
+
+	for (i = 0; i < nr; i++) {
+		struct ibv_device *d = list[i];
+		struct ibv_context *c;
+		struct ibv_device_attr dattr;
+		int found = 0;
+		int port;
+
+		if (d->node_type != IBV_NODE_CA)
+			continue;
+
+		if (d->transport_type != IBV_TRANSPORT_IB)
+			continue;
+
+		c = ibv_open_device(d);
+		if (!c) {
+			syslog(LOG_CRIT, "Cannot open device %s\n", ibv_get_device_name(d));
+			return 1;
+		}
+
+		if (ibv_query_device(c, &dattr)) {
+			syslog(LOG_CRIT, "Cannot query device %s\n", ibv_get_device_name(d));
+			return 1;
+		}
+
+		for (port = 1; port <= dattr.phys_port_cnt; port++) {
+			struct ibv_port_attr attr;
+
+			if (ibv_query_port(c, port, &attr)) {
+				syslog(LOG_CRIT, "Cannot query port %s:%d\n", ibv_get_device_name(d), port);
+				return 1;
+			}
+
+			if (attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
+				if (check_rdma_device(INFINIBAND, port, ib_name, c, &attr, &dattr) &&
+					(!i2r[ROCE].mtu || i2r[ROCE].mtu == i2r[INFINIBAND].mtu))
+					found = 1;
+
+			} else if (attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
+				if (check_rdma_device(ROCE, port, roce_name, c, &attr, &dattr) &&
+					(!i2r[INFINIBAND].mtu || i2r[ROCE].mtu == i2r[INFINIBAND].mtu))
+					found = 1;
+			}
+		}
+
+		if (!found)
+			ibv_close_device(c);
+	}
+
+
+	ibv_free_device_list(list);
+
+
+	if (!i2r[ROCE].context) {
+
+		if (roce_name[0] == '-')
+			bridging = false;
+		else {
+			if (roce_name)
+				syslog(LOG_CRIT, "ROCE device %s not found\n", roce_name);
+			else
+				syslog(LOG_CRIT,  "No ROCE device available.\n");
+
+			return 1;
+		}
+	}
+
+	if (!i2r[INFINIBAND].context) {
+
+		if (ib_name[0] == '-' && i2r[ROCE].context)
+			bridging = false;
+		else {
+			if (ib_name)
+				syslog(LOG_CRIT, "Infiniband device %s not found.\n", ib_name);
+			else
+				syslog(LOG_CRIT, "No Infiniband device available.\n");
+
+			return 1;
+		}
+	}
+	return 0;
+}
+
+/*
+ * Multicast Handling
+ */
+#define MAX_MC 500
+
+static unsigned nr_mc;
+static unsigned active_mc;	/* MC groups actively briding */
+
+struct ah_info {
+	struct ibv_ah *ah;
+	unsigned remote_qpn;
+	unsigned remote_qkey;
+};
+
+enum mc_status { MC_OFF, MC_JOINING, MC_JOINED, MC_ERROR, NR_MC_STATUS };
+
+const char *mc_text[NR_MC_STATUS] = { "Inactive", "Joining", "Joined", "Error" };
+
+static struct mc {
+	struct in_addr addr;
+	enum mc_status status[2];
+	enum bool sendonly[2];
+	enum bool beacon;
+	struct ah_info ai[2];
+	struct mc *next;	/* For the hash */
+	struct sockaddr *sa[2];
+	struct mgid_signature *mgid_mode;
+	const char *text;
+} mcs[MAX_MC];
+
+/*
+ * Lookup of IP / Port via Hashes based on the IPv4 Multicast address.
+ * Only 24 bits distinguish the Multicast address so ignore the
+ * highest 8 bits
+ */
+static struct mc *mc_hash[0x100];
+
+static unsigned sockaddr_hash(unsigned a)
+{
+	unsigned low = a & 0xff;
+	unsigned middle = (a & 0xff00) >> 8;
+	unsigned high = (a & 0xff0000) >> 16;
+
+	return (low + middle + high) & 0xff;
+}
+
+static struct mc *__hash_lookup_mc(struct in_addr addr, unsigned index)
+{
+	struct mc *p = mc_hash[index];
+
+	while (p && (addr.s_addr != p->addr.s_addr))
+		p = p->next;
+
+	return p;
+}
+
+static struct mc *hash_lookup_mc(struct in_addr addr)
+{
+	unsigned a = ntohl(addr.s_addr) | 0xe0000000; /* Infiniband may strip top 4 bits so provide them */
+	struct in_addr x = {
+		.s_addr = htonl(a)
+	};
+	unsigned index = sockaddr_hash(a);
+
+	return __hash_lookup_mc(x, index);
+}
+
+static int hash_add_mc(struct mc *m)
+{
+	unsigned a = htonl(m->addr.s_addr);
+	unsigned index = sockaddr_hash(a);
+
+	if (__hash_lookup_mc(m->addr, index))
+		return -EEXIST;
+
+	m->next = mc_hash[index];
+	mc_hash[index] = m;
+	return 0;
+}
+
+static struct mgid_signature *find_mgid_mode(char *p)
+{
+	struct mgid_signature *g;
+
+	for(g = mgid_signatures; g < mgid_signatures + nr_mgid_signatures; g++)
+		if (strcasecmp(p, g->id) == 0)
+			break;
+
+	if (g >= mgid_signatures + nr_mgid_signatures) {
+		fprintf(stderr, "Not a valid mgid mode %s\n", p);
+		return NULL;
+	}
+	return g;
+}
+
+/* Multicast group specifications on the command line */
+static int new_mc_addr(char *arg,
+	       enum bool sendonly_infiniband,
+	      enum bool sendonly_roce)
+{
+	struct addrinfo *res;
+	char *service;
+	const struct addrinfo hints = {
+		.ai_family = AF_INET,
+		.ai_socktype = SOCK_DGRAM,
+		.ai_protocol = IPPROTO_UDP
+	};
+	struct sockaddr_in *si;
+	struct mc *m = mcs + nr_mc;
+	char *p;
+	int ret;
+
+	if (nr_mc == MAX_MC) {
+		fprintf(stderr, "Too many multicast groups\n");
+		return 1;
+	}
+
+	m->sendonly[INFINIBAND] = sendonly_infiniband;
+	m->sendonly[ROCE] = sendonly_roce;
+	m->text = strdup(arg);
+
+	service = strchr(arg, ':');
+
+	if (service) {
+
+		*service++ = 0;
+		p = service;
+
+	} else {
+		char *s = alloca(10);
+
+		snprintf(s, 10, "%d", default_port);
+		service = s;
+		p = arg;
+	}
+
+	p = strchr(p, '/');
+	if (p) {
+		*p++ = 0;
+		m->mgid_mode = find_mgid_mode(p);
+
+		if (!m->mgid_mode)
+			return -EINVAL;
+	} else
+		m->mgid_mode = mgid_mode;
+
+	ret = getaddrinfo(arg, service, &hints, &res);
+	if (ret) {
+		fprintf(stderr, "getaddrinfo() failed (%s) - invalid IP address.\n", gai_strerror(ret));
+		return ret;
+	}
+
+	ret = 1;
+
+	si = (struct sockaddr_in *)res->ai_addr;
+
+	m->addr = si->sin_addr;
+	if (!IN_MULTICAST(ntohl(m->addr.s_addr))) {
+		fprintf(stderr, "Not a multicast address (%s)\n", arg);
+		goto out;
+	}
+
+	ret = hash_add_mc(m);
+	if (ret) {
+		fprintf(stderr, "Duplicate multicast address (%s)\n", arg);
+		goto out;
+	}
+
+	m->sa[ROCE] = malloc(sizeof(struct sockaddr_in));
+	memcpy(m->sa[ROCE], si, sizeof(struct sockaddr_in));
+	m->sa[INFINIBAND] = m->sa[ROCE];
+
+	if (m->mgid_mode) {
+		/*
+		 * MGID is build according to according to RFC 4391 Section 4
+		 * by taking 28 bits and putting them into the mgid
+		 *
+		 * But then CLLM and others include the full 32 bit...
+		 * Deal with this crappy situation.
+		 */
+		struct sockaddr_ib *saib	= calloc(1, sizeof(struct sockaddr_ib));
+		unsigned short *mgid_header	= (unsigned short *)saib->sib_addr.sib_raw;
+		unsigned short *mgid_signature	= (unsigned short *)(saib->sib_addr.sib_raw + 2);
+		unsigned short *mgid_port	= (unsigned short *)(saib->sib_addr.sib_raw + 10);
+		unsigned int *mgid_ipv4	= (unsigned int *)(saib->sib_addr.sib_raw + 12);
+		unsigned int  multicast = ntohl(m->addr.s_addr);
+
+		saib->sib_family = AF_IB,
+		saib->sib_sid = si->sin_port;
+
+		*mgid_header = htons(0xff15);
+
+		*mgid_signature = htons(m->mgid_mode->signature);
+
+		if (m->mgid_mode->port_in_mgid)
+			*mgid_port = si->sin_port;
+
+		if (!m->mgid_mode->full_ipv4)
+			/* Strip to 28 bits according to RFC */
+			multicast &= 0x0fffffff;
+
+		*mgid_ipv4 = htonl(multicast);
+
+		m->sa[INFINIBAND] = (struct sockaddr *)saib;
+	}
+
+
+	nr_mc++;
+	ret = 0;
+
+out:
+
+	freeaddrinfo(res);
+	return ret;
+}
+
+static int _join_mc(struct in_addr addr, struct sockaddr *sa,
+				unsigned port,enum interfaces i,
+				enum bool sendonly, void *private)
+{
+	struct rdma_cm_join_mc_attr_ex mc_attr = {
+		.comp_mask = RDMA_CM_JOIN_MC_ATTR_ADDRESS | RDMA_CM_JOIN_MC_ATTR_JOIN_FLAGS,
+		.join_flags = sendonly ? RDMA_MC_JOIN_FLAG_SENDONLY_FULLMEMBER
+                                       : RDMA_MC_JOIN_FLAG_FULLMEMBER,
+		.addr = sa
+	};
+	int ret;
+
+	ret = rdma_join_multicast_ex(id(i), &mc_attr, private);
+
+	if (ret) {
+		syslog(LOG_ERR, "Failed to create join request %s:%d on %s. Error %d\n",
+			inet_ntoa(addr), port,
+			interfaces_text[i],
+			errno);
+		return 1;
+	}
+	syslog(LOG_NOTICE, "Join Request %sMC group %s:%d on %s .\n",
+		sendonly ? "Sendonly " : "",
+		inet_ntoa(addr), port,
+		interfaces_text[i]);
+	st(i2r + i, join_requests);
+	return 0;
+}
+
+static int _leave_mc(struct in_addr addr,struct sockaddr *si, enum interfaces i)
+{
+	int ret;
+
+	ret = rdma_leave_multicast(id(i), si);
+	if (ret) {
+		perror("Failure to leave");
+		return 1;
+	}
+	syslog(LOG_NOTICE, "Leaving MC group %s on %s .\n",
+		inet_ntoa(addr),
+		interfaces_text[i]);
+	st(i2r + i, leave_requests);
+	return 0;
+}
+
+static int leave_mc(enum interfaces i)
+{
+	int j;
+	int ret;
+
+	for (j = 0; j < nr_mc; j++) {
+		struct mc *m = mcs + j;
+
+		ret = _leave_mc(m->addr, m->sa[i], i);
+		if (ret)
+			return 1;
+	}
+	return 0;
+}
+
+/*
+ * Manage freelist using simple single linked list with the pointer
+ * to the next free element at the beginning of the free buffer
+ */
+
+// static const unsigned buf_size = 4096;
+// static const unsigned nr_buffers = 20000;
+
+#define buf_size 4096
+#define nr_buffers 20000
+
+struct buf {
+	struct buf *next;	/* Next buffer */
+	enum bool free;
+	/* Add more metadata here */
+	struct ibv_grh grh;	/* GRH header as included in UD connections */
+	uint8_t payload[buf_size];	/* I wish this was page aligned at some point */
+};
+
+static void beacon_received(struct buf *buf);
+
+static struct buf buffers[nr_buffers];
+
+static struct buf *nextbuffer;	/* Pointer to next available RDMA buffer */
+
+static void free_buffer(struct buf *buf)
+{
+	buf->free = true;
+	buf->next = nextbuffer;
+	nextbuffer = buf;
+}
+
+static void init_buf(void)
+{
+	int i;
+
+	/*
+	 * Free in reverse so that we have a linked list
+	 * starting at the first element which points to
+	 * the second and so on.
+	 */
+	for (i = nr_buffers; i > 0; i--)
+		free_buffer(&buffers[i-1]);
+}
+
+static struct buf *alloc_buffer(void)
+{
+	struct buf *buf = nextbuffer;
+
+	if (buf) {
+		nextbuffer = buf->next;
+		buf->free = false;
+	}
+
+	return buf;
+}
+
+
+/*
+ * Handling of RDMA work requests
+ */
+
+static int post_receive_buffers(struct i2r_interface *i)
+{
+	struct ibv_recv_wr recv_wr, *recv_failure;
+	struct ibv_sge sge;
+	int ret = 0;
+
+	if (!nextbuffer || i->active_receive_buffers >= i->nr_cq / 2 )
+		return -EAGAIN;
+
+	recv_wr.next = NULL;
+	recv_wr.sg_list = &sge;
+	recv_wr.num_sge = 1;
+
+	sge.length = i->mtu + sizeof(struct ibv_grh);
+	sge.lkey = i->mr->lkey;
+
+	while (i->active_receive_buffers < i->nr_cq / 2) {
+
+		struct buf *buf = alloc_buffer();
+
+
+		if (!buf) {
+			syslog(LOG_NOTICE, "No free buffers left\n");
+			ret = -ENOMEM;
+			break;
+		}
+
+		/* Use the buffer address for the completion handler */
+		recv_wr.wr_id = (uint64_t)buf;
+		sge.addr = (uint64_t)&buf->grh;
+		ret = ibv_post_recv(i->id->qp, &recv_wr, &recv_failure);
+		if (ret) {
+			free_buffer(buf);
+			syslog(LOG_WARNING, "ibv_post_recv failed: %d\n", ret);
+			break;
+                }
+		i->active_receive_buffers++;
+	}
+	return ret;
+}
+
+
+static int qp_destroy(struct i2r_interface *i)
+{
+	if (i->id->qp)
+		rdma_destroy_qp(i->id);
+
+	if (i->cq)
+		ibv_destroy_cq(i->cq);
+
+	ibv_dereg_mr(i->mr);
+
+	if (i->pd)
+		ibv_dealloc_pd(i->pd);
+
+	return 0;
+}
+
+/* Retrieve Kernel Stack info about the interface */
+static void get_if_info(struct i2r_interface *i, int ifindex)
+{
+	int fh = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP);
+	struct ifreq ifr;
+
+	if (fh < 0)
+		goto err;
+
+	if (!ifindex) {
+		/* RDMA subsystem did not give us Interface info */
+		if (i->if_name[0]) {
+			memcpy(ifr.ifr_name, i->if_name, IFNAMSIZ);
+			goto have_name;
+		}
+		goto err;
+	}
+
+	ifr.ifr_ifindex = ifindex;
+
+	if (ioctl(fh, SIOCGIFNAME, &ifr) < 0)
+		goto err;
+
+	memcpy(i->if_name, ifr.ifr_name, IFNAMSIZ);
+
+have_name:
+	if (ioctl(fh, SIOCGIFADDR, &ifr) < 0)
+		goto err2;
+
+	memcpy(&i->if_addr, &ifr.ifr_addr, sizeof(struct sockaddr_in));
+	goto out;
+
+err:
+	strcpy(i->if_name, "-");
+err2:
+	memset(&i->if_addr, 0, sizeof(i->if_addr));
+out:
+	close(fh);
+}
+
+static void setup_interface(enum interfaces in)
+{
+	struct i2r_interface *i = i2r + in;
+	struct ibv_gid_entry *e;
+	struct ibv_qp_init_attr_ex init_qp_attr_ex;
+	char buf[INET6_ADDRSTRLEN];
+	struct sockaddr_in *sin;
+	int ret;
+
+	if (!i->context)
+		return;
+
+	i->rdma_events = rdma_create_event_channel();
+	if (!i->rdma_events) {
+		syslog(LOG_CRIT, "rdma_create_event_channel() for %s failed (%d).\n",
+			interfaces_text[in], errno);
+		abort();
+	}
+
+	ret = rdma_create_id(i->rdma_events, &i->id, i, RDMA_PS_UDP);
+	if (ret) {
+		syslog(LOG_CRIT, "Failed to allocate RDMA CM ID for %s failed (%d).\n",
+			interfaces_text[in], errno);
+		abort();
+	}
+
+	/* Determine the GID */
+	i->iges = ibv_query_gid_table(i->context, i->ige, MAX_GID, 0);
+
+	if (!i->iges) {
+		syslog(LOG_CRIT, "Failed to obtain GID table for %s\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	/* Find the correct gid entry */
+	for (e = i->ige; e < i->ige + i->iges; e++) {
+
+		if (e->port_num != i->port)
+			continue;
+
+		if (e->gid_type == IBV_GID_TYPE_IB && in == INFINIBAND)
+			break;
+
+		if (e->gid_type == IBV_GID_TYPE_ROCE_V2 && in == ROCE)
+			break;
+	}
+
+	if (e >= i->ige + i->iges) {
+		syslog(LOG_CRIT, "Failed to find GIDs in GID table for %s\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	/* Copy our connection info from GID table */
+	i->gid = e->gid;
+	i->gid_index = e->gid_index;
+	i->ifindex = e->ndev_ifindex;
+
+	/*
+	 * Work around the quirk of ifindex always being zero for
+	 * INFINIBAND interfaces. Just assume its ib0.
+	 */
+	if (!i->ifindex && in == INFINIBAND)
+		strcpy(i->if_name,"ib0");
+
+	get_if_info(i, i->ifindex);
+
+	sin = calloc(1, sizeof(struct sockaddr_in));
+	sin->sin_family = AF_INET;
+	*sin = i->if_addr;
+	sin->sin_port = htons(default_port);
+	i->bindaddr = (struct sockaddr *)sin;
+
+	ret = rdma_bind_addr(i->id, i->bindaddr);
+	if (ret) {
+		syslog(LOG_CRIT, "Failed to bind %s interface. Error %d\n",
+			interfaces_text[in], errno);
+		abort();
+	}
+
+	i->pd = ibv_alloc_pd(i->id->verbs);
+	if (!i->pd) {
+		syslog(LOG_CRIT, "ibv_alloc_pd failed for %s.\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	i->comp_events = ibv_create_comp_channel(i->context);
+	if (!i->comp_events) {
+		syslog(LOG_CRIT, "ibv_create_comp_channel failed for %s.\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	i->nr_cq = MIN(i->device_attr.max_cqe, nr_buffers / 2);
+	i->cq = ibv_create_cq(i->id->verbs, i->nr_cq, i, i->comp_events, 0);
+	if (!i->cq) {
+		syslog(LOG_CRIT, "ibv_create_cq failed for %s.\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	memset(&init_qp_attr_ex, 0, sizeof(init_qp_attr_ex));
+	init_qp_attr_ex.cap.max_send_wr = i->nr_cq;
+	init_qp_attr_ex.cap.max_recv_wr = i->nr_cq;
+	init_qp_attr_ex.cap.max_send_sge = 1;
+	init_qp_attr_ex.cap.max_recv_sge = 1;
+	init_qp_attr_ex.cap.max_inline_data = MAX_INLINE_DATA;
+	init_qp_attr_ex.qp_context = i;
+	init_qp_attr_ex.sq_sig_all = 0;
+	init_qp_attr_ex.qp_type = IBV_QPT_UD;
+	init_qp_attr_ex.send_cq = i->cq;
+	init_qp_attr_ex.recv_cq = i->cq;
+
+	init_qp_attr_ex.comp_mask = IBV_QP_INIT_ATTR_CREATE_FLAGS|IBV_QP_INIT_ATTR_PD;
+	init_qp_attr_ex.pd = i->pd;
+	init_qp_attr_ex.create_flags = IBV_QP_CREATE_BLOCK_SELF_MCAST_LB;
+
+	ret = rdma_create_qp_ex(i->id, &init_qp_attr_ex);
+	if (ret) {
+		syslog(LOG_CRIT, "rdma_create_qp_ex failed for %s. Error %d.\n",
+			interfaces_text[in], errno);
+		abort();
+	}
+
+	i->mr = ibv_reg_mr(i->pd, buffers, nr_buffers * sizeof(struct buf), IBV_ACCESS_LOCAL_WRITE);
+	if (!i->mr) {
+		syslog(LOG_CRIT, "ibv_reg_mr failed for %s.\n",
+			interfaces_text[in]);
+		abort();
+	}
+
+	syslog(LOG_NOTICE, "%s interface %s/%s port %d GID=%s/%d IPv4=%s CQs=%u MTU=%u ready.\n",
+		interfaces_text[in],
+		ibv_get_device_name(i->context->device),i->if_name,
+		i->port,
+		inet_ntop(AF_INET6, e->gid.raw, buf, INET6_ADDRSTRLEN),i->gid_index,
+		inet_ntoa(i->if_addr.sin_addr),
+		i->nr_cq,
+		i->mtu
+	);
+}
+
+static void shutdown_ib(void)
+{
+	if (!i2r[INFINIBAND].context)
+		return;
+
+	leave_mc(INFINIBAND);
+
+	/* Shutdown Interface */
+	qp_destroy(i2r + INFINIBAND);
+	rdma_destroy_id(id(INFINIBAND));
+}
+
+static void shutdown_roce(void)
+{
+	if (!i2r[ROCE].context)
+		return;
+
+	leave_mc(ROCE);
+
+	/* Shutdown Interface */
+	qp_destroy(i2r + ROCE);
+	rdma_destroy_id(id(ROCE));
+}
+
+/*
+ * Join MC groups. This is called from the event loop every second
+ * as long as there are unjoined groups
+ */
+static void join_processing(void)
+{
+	int i;
+	enum interfaces in;
+	int mcs_per_call = 0;
+
+	for (i = 0; i < nr_mc; i++) {
+		struct mc *m = mcs + i;
+		unsigned port = ntohs(((struct sockaddr_in *)(m->sa[ROCE]))->sin_port);
+
+		if (m->status[ROCE] == MC_JOINED && m->status[INFINIBAND] == MC_JOINED)
+			continue;
+
+		for(in = 0; in < 2; in++)
+			switch(m->status[in]) {
+
+			case MC_OFF:
+				if (_join_mc(m->addr, m->sa[in], port, in, m->sendonly[in], m) == 0)
+					m->status[in] = MC_JOINING;
+				break;
+
+			case MC_ERROR:
+
+				_leave_mc(m->addr, m->sa[in], in);
+				m->status[in] = MC_OFF;
+				syslog(LOG_WARNING, "Left Multicast group %s on %s due to MC_ERROR\n",
+						m->text, interfaces_text[in]);
+				break;
+
+			case MC_JOINED:
+				break;
+
+			default:
+				syslog(LOG_ERR, "Bad MC status %d MC %s on %s\n",
+					       m->status[in], m->text, interfaces_text[in]);
+				break;
+		}
+
+		mcs_per_call++;
+
+		if (mcs_per_call > 10)
+			break;
+
+	}
+}
+
+static void handle_rdma_event(enum interfaces in)
+{
+	struct rdma_cm_event *event;
+	int ret;
+	struct i2r_interface *i = i2r + in;
+
+	ret = rdma_get_cm_event(i->rdma_events, &event);
+	if (ret) {
+		syslog(LOG_WARNING, "rdma_get_cm_event()_ failed. Error = %d\n", errno);
+		return;
+	}
+
+	switch(event->event) {
+		/* Connection events */
+		case RDMA_CM_EVENT_MULTICAST_JOIN:
+			{
+				struct rdma_ud_param *param = &event->param.ud;
+				struct mc *m = (struct mc *)param->private_data;
+				struct ah_info *a = m->ai + in;
+				char buf[40];
+
+				a->remote_qpn = param->qp_num;
+				a->remote_qkey = param->qkey;
+				a->ah = ibv_create_ah(i->pd, &param->ah_attr);
+				if (!a->ah) {
+					syslog(LOG_ERR, "Failed to create AH for Multicast group %s on %s \n",
+						m->text, interfaces_text[in]);
+					m->status[in] = MC_ERROR;
+					break;
+				}
+				m->status[in] = MC_JOINED;
+
+				/* Things actually work if both multicast groups are joined */
+				if (!bridging || m->status[in ^ 1] == MC_JOINED)
+					active_mc++;
+
+				syslog(LOG_NOTICE, "Joined %s MLID 0x%x sl %u on %s\n",
+					inet_ntop(AF_INET6, param->ah_attr.grh.dgid.raw, buf, 40),
+					param->ah_attr.dlid,
+					param->ah_attr.sl,
+					interfaces_text[in]);
+				st(i, join_success);
+			}
+			break;
+
+		case RDMA_CM_EVENT_MULTICAST_ERROR:
+			{
+				struct rdma_ud_param *param = &event->param.ud;
+				struct mc *m = (struct mc *)param->private_data;
+
+				syslog(LOG_ERR, "Multicast Error. Group %s on %s\n",
+					m->text, interfaces_text[in]);
+
+				/* If already joined then the bridging may no longer work */
+				if (!bridging || (m->status[in] == MC_JOINED && m->status[in ^ 1] == MC_JOINED))
+				       active_mc--;
+
+				m->status[in] = MC_ERROR;
+				st(i, join_failure);
+			}
+			break;
+
+		case RDMA_CM_EVENT_ADDR_RESOLVED:
+			syslog(LOG_ERR, "Unexpected event ADDR_RESOLVED\n");
+			break;
+
+		/* Disconnection events */
+		case RDMA_CM_EVENT_ADDR_ERROR:
+		case RDMA_CM_EVENT_ROUTE_ERROR:
+		case RDMA_CM_EVENT_ADDR_CHANGE:
+			syslog(LOG_ERR, "RDMA Event handler:%s status: %d\n",
+				rdma_event_str(event->event), event->status);
+			break;
+		default:
+			syslog(LOG_NOTICE, "RDMA Event handler:%s status: %d\n",
+				rdma_event_str(event->event), event->status);
+			break;
+	}
+
+	rdma_ack_cm_event(event);
+}
+
+/*
+ * Do not use a buffer but simply include data directly into WR.
+ * Advantage: No buffer used and therefore faster since no memory
+ * fetch has to be done by the RDMA subsystem and no completion
+ * event has to be handled.
+ *
+ * Space in the WR is limited, so it only works for very small packets.
+ */
+static int send_inline(struct i2r_interface *i, void *buf, unsigned len, struct ah_info *ai)
+{
+	struct ibv_sge sge = {
+		.length = len,
+		.addr = (uint64_t)buf
+	};
+	struct ibv_send_wr wr = {
+		.sg_list = &sge,
+		.num_sge = 1,
+		.opcode = IBV_WR_SEND_WITH_IMM,
+		.send_flags = IBV_SEND_INLINE,
+		.imm_data = htobe32(i->id->qp->qp_num),
+		.wr = {
+			/* Get addr info  */
+			.ud = {
+				.ah = ai->ah,
+				.remote_qpn = ai->remote_qpn,
+				.remote_qkey = ai->remote_qkey
+			}
+		}
+
+	};
+	struct ibv_send_wr *bad_send_wr;
+	int ret;
+
+	if (len > MAX_INLINE_DATA)
+		return -E2BIG;
+
+	ret = ibv_post_send(i->id->qp, &wr, &bad_send_wr);
+	if (ret)
+		syslog(LOG_WARNING, "Failed to post inline send: %d\n", ret);
+
+	return ret;
+}
+
+static int send_buf(struct i2r_interface *i, struct buf *buf, unsigned len, struct ah_info *ai)
+{
+	struct ibv_send_wr wr, *bad_send_wr;
+	struct ibv_sge sge;
+	int ret;
+
+	memset(&wr, 0, sizeof(wr));
+	wr.sg_list = &sge;
+	wr.num_sge = 1;
+	wr.opcode = IBV_WR_SEND_WITH_IMM;
+	wr.send_flags = IBV_SEND_SIGNALED;
+	wr.wr_id = (uint64_t)buf;
+	wr.imm_data = htobe32(i->id->qp->qp_num);
+
+	/* Get addr info  */
+	wr.wr.ud.ah = ai->ah;
+	wr.wr.ud.remote_qpn = ai->remote_qpn;
+	wr.wr.ud.remote_qkey = ai->remote_qkey;
+
+	sge.length = len;
+	sge.lkey = i->mr->lkey;
+	sge.addr = (uint64_t)buf->payload;
+
+	ret = ibv_post_send(i->id->qp, &wr, &bad_send_wr);
+	if (ret)
+		syslog(LOG_WARNING, "Failed to post send: %d\n", ret);
+
+	return ret;
+}
+
+static int recv_buf(struct i2r_interface *i,
+			struct buf *buf, struct ibv_wc *w)
+{
+	struct mc *m;
+	enum interfaces in = i - i2r;
+	unsigned len;
+	struct in_addr source_addr = *(struct in_addr *)(buf->grh.sgid.raw + 12);
+	struct in_addr dest_addr = *(struct in_addr *)(buf->grh.dgid.raw + 12);
+	unsigned port = 0;
+	char xbuf[INET6_ADDRSTRLEN];
+
+	if (!(w->wc_flags & IBV_WC_GRH)) {
+		syslog(LOG_WARNING, "Discard Packet: No GRH provided %s/%s\n",
+			inet_ntoa(dest_addr), interfaces_text[in]);
+		return -EINVAL;
+	}
+
+	m = hash_lookup_mc(dest_addr);
+
+	if (!m) {
+		syslog(LOG_WARNING, "Discard Packet: Multicast group %s not found\n",
+			inet_ntoa(dest_addr));
+		return -ENODATA;
+	}
+
+	if (m->sendonly[in]) {
+
+		syslog(LOG_WARNING, "Discard Packet: Received data from Sendonly MC group %s from %s\n",
+			m->text, interfaces_text[in]);
+		return -EPERM;
+	}
+
+	if (in == INFINIBAND) {
+		unsigned char *mgid = buf->grh.dgid.raw;
+		unsigned short signature = ntohs(*(unsigned short*)(mgid + 2));
+
+		if (mgid[0] != 0xff) {
+			syslog(LOG_WARNING, "Discard Packet: Not multicast. MGID=%s/%s\n",
+				inet_ntop(AF_INET6, mgid, xbuf, INET6_ADDRSTRLEN), interfaces_text[in]);
+			return -EINVAL;
+		}
+
+		if (memcmp(&buf->grh.sgid, &i->gid, sizeof(union ibv_gid)) == 0) {
+			syslog(LOG_WARNING, "Discard Packet: Loopback from this host. MGID=%s/%s\n",
+				inet_ntop(AF_INET6, mgid, xbuf, INET6_ADDRSTRLEN), interfaces_text[in]);
+			return -EINVAL;
+		}
+
+		if (m->mgid_mode) {
+			if (signature == m->mgid_mode->signature) {
+				if (m->mgid_mode->port_in_mgid)
+					port = ntohs(*((unsigned short *)(mgid + 10)));
+			} else {
+				syslog(LOG_WARNING, "Discard Packet: MGID multicast signature(%x)  mismatch. MGID=%s\n",
+						signature,
+						inet_ntop(AF_INET6, mgid, xbuf, INET6_ADDRSTRLEN));
+				return -EINVAL;
+			}
+		}
+
+	} else { /* ROCE */
+		struct in_addr local_addr = ((struct sockaddr_in *)i->bindaddr)->sin_addr;
+
+		if (source_addr.s_addr == local_addr.s_addr) {
+			syslog(LOG_WARNING, "Discard Packet: Loopback from this host. %s/%s\n",
+				inet_ntoa(source_addr), interfaces_text[in]);
+			return -EINVAL;
+		}
+	}
+
+	if (m->beacon)	{
+		beacon_received(buf);
+		return 1;
+	}
+
+	if (!bridging)
+		return -ENOSYS;
+
+	len = w->byte_len - sizeof(struct ibv_grh);
+	return send_buf(i2r + (in ^ 1), buf, len, m->ai + (in ^ 1));
+}
+
+static void handle_comp_event(enum interfaces in)
+{
+	struct i2r_interface *i = i2r + in;
+	struct ibv_cq *cq;
+	void *private;
+	int cqs;
+	struct ibv_wc wc[100];
+	int j;
+
+	ibv_get_cq_event(i->comp_events, &cq, &private);
+	if (cq != i->cq) {
+		syslog(LOG_CRIT, "ibv_get_cq_event: CQ mismatch\n");
+		abort();
+	}
+
+	ibv_ack_cq_events(cq, 1);
+	if (ibv_req_notify_cq(cq, 0)) {
+		syslog(LOG_CRIT, "ibv_req_notify_cq: Failed\n");
+		abort();
+	}
+
+redo:
+	/* Retrieve completion events and process incoming data */
+	cqs = ibv_poll_cq(cq, 100, wc);
+	if (cqs < 0) {
+		syslog(LOG_WARNING, "CQ polling failed with: %d on %s\n",
+			errno, interfaces_text[i - i2r]);
+		goto exit;
+	}
+
+	if (cqs == 0)
+		goto exit;
+
+	if (cqs > cq_high)
+		cq_high = cqs;
+
+	for (j = 0; j < cqs; j++) {
+		struct ibv_wc *w = wc + j;
+		struct buf *buf = (struct buf *)w->wr_id;
+
+		if (w->status == IBV_WC_SUCCESS && w->opcode == IBV_WC_RECV) {
+
+			i->active_receive_buffers--;
+			st(i, packets_received);
+
+			if (recv_buf(i, buf, w))
+				free_buffer(buf);
+			else
+				st(i, packets_bridged);
+
+		} else {
+			if (w->status == IBV_WC_SUCCESS && w->opcode == IBV_WC_SEND)
+				st(i, packets_sent);
+			else
+				syslog(LOG_NOTICE, "Strange CQ Entry %d/%d: Status:%x Opcode:%x Len:%u QP=%u SRC_QP=%u Flags=%x\n",
+					j, cqs, w->status, w->opcode, w->byte_len, w->qp_num, w->src_qp, w->wc_flags);
+
+			free_buffer(buf);
+		}
+	}
+	goto redo;
+
+exit:
+
+	/* Since we freed some buffers up we may be able to post more of them */
+	post_receive_buffers(i);
+}
+
+static void handle_async_event(enum interfaces in)
+{
+	struct i2r_interface *i = i2r + in;
+	struct ibv_async_event event;
+
+	if (!ibv_get_async_event(i2r[in].context, &event))
+		syslog(LOG_ALERT, "Async event retrieval failed.\n");
+	else
+		syslog(LOG_ALERT, "Async RDMA EVENT %d\n", event.event_type);
+
+	/*
+	 * Regardless of what the cause is the first approach here
+	 * is to simply terminate the program.
+	 * We can make exceptions later.
+	 */
+
+	terminated = true;
+
+        ibv_ack_async_event(&event);
+}
+
+static int status_fd;
+
+static void status_write(void)
+{
+	static char b[10000];
+	int i,j;
+	int n = 0;
+	int free = 0;
+	struct buf *buf;
+	int fd = status_fd;
+	struct mc *m;
+
+	if (update_requested) {
+
+		char name[40];
+		time_t t = time(NULL);
+		struct tm tm;
+
+		localtime(&t);
+
+		snprintf(name, 40, "ib2roce-%d%02d%02dT%02d%02d%02d",
+				tm.tm_year + 1900, tm.tm_mon +1, tm.tm_mday,
+				tm.tm_hour, tm.tm_min, tm.tm_sec);
+		fd = open(name, O_CREAT | O_RDWR,  S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
+
+	} else
+		lseek(fd, SEEK_SET, 0);
+
+	for(buf = buffers; buf < buffers + nr_buffers; buf++)
+		if (buf->free)
+		       free++;
+
+	n+= sprintf(b + n, "Multicast: Active=%u NR=%u Max=%u\nBuffers: Active=%u Total=%u CQ#High=%u\n\n",
+		active_mc, nr_mc, MAX_MC, nr_buffers-free , nr_buffers, cq_high);
+
+	for(m = mcs; m < mcs + nr_mc; m++)
+
+		n += sprintf(n + b, "%s INFINIBAND: %s%s%s ROCE: %s%s\n",
+			inet_ntoa(m->addr),
+			mc_text[m->status[INFINIBAND]],
+			m->sendonly[INFINIBAND] ? "Sendonly " : "",
+			m->mgid_mode ? m->mgid_mode->id : "",
+			mc_text[m->status[ROCE]],
+			m->sendonly[ROCE] ? "Sendonly" : "");
+
+	for(i = 0; i < 2; i++) {
+		n += sprintf(b + n, "\nPacket Statistics for %s:\n", interfaces_text[i]);
+
+		for(j =0; j < nr_stats; j++) {
+			n += sprintf(b + n, "%s=%lu\n", stats_text[j], i2r[i].stats[j]);
+		}
+	}
+
+	n += sprintf(n + b, "\n\n\n\n\n\n\n\n");
+	write(fd, b, n);
+
+	if (update_requested) {
+		close(fd);
+		update_requested = false;
+	}
+}
+
+/*
+ * Beacon processing
+ */
+struct beacon_info {
+	char version[10];
+	struct in_addr infiniband;
+	struct in_addr roce;
+	unsigned nr_mc;
+};
+
+struct mc *beacon_mc;
+
+static void beacon_received(struct buf *buf)
+{
+	struct beacon_info *b = (struct beacon_info *)buf;
+	char ib[40];
+
+	strcpy(ib, inet_ntoa(b->infiniband));
+	syslog(LOG_NOTICE, "Received Beacon on %s Version %s IB=%s, ROCE=%s MC groups=%u\n",
+		beacon_mc->text, b->version, ib, inet_ntoa(b->roce), b->nr_mc);
+}
+
+static void beacon_send(void)
+{
+	struct beacon_info b;
+	struct buf *buf;
+	int i;
+
+	memcpy(b.version, VERSION, 10);
+	b.infiniband = i2r[INFINIBAND].if_addr.sin_addr;
+	b.roce = i2r[ROCE].if_addr.sin_addr;
+	b.nr_mc = nr_mc;
+
+	for(i = 0; i < NR_INTERFACES; i++)
+	if (i2r[i].context && beacon_mc->status[i] == MC_JOINED) {
+		if (sizeof(b) > MAX_INLINE_DATA) {
+			buf = alloc_buffer();
+			memcpy(buf->payload, &b, sizeof(b));
+			send_buf(i2r + i, buf, sizeof(b), beacon_mc->ai + i);
+		} else
+			send_inline(i2r + i, &b, sizeof(b), beacon_mc->ai + INFINIBAND);
+	}
+}
+
+static void beacon_setup(void)
+{
+	struct mc *m = mcs + nr_mc++;
+	struct sockaddr_in *sin;
+
+	sin = calloc(1, sizeof(struct sockaddr_in));
+
+	sin->sin_family = AF_INET,
+	sin->sin_port = htons(999);
+	sin->sin_addr.s_addr = inet_addr("239.1.2.3");
+
+	memset(m, 0, sizeof(*m));
+	m->beacon = true;
+	m->text = "239.1.2.3:999";
+	m->addr = sin->sin_addr;
+	m->sa[INFINIBAND] = m->sa[ROCE] = (struct sockaddr *)sin;
+	if (hash_add_mc(m)) {
+		syslog(LOG_ERR, "Beacon MC already in use.\n");
+		beacon = false;
+		free(sin);
+	}
+	beacon_mc = m;
+}
+
+
+static int event_loop(void)
+{
+	unsigned timeout = 1000;
+	struct pollfd pfd[6] = {
+		{ i2r[INFINIBAND].rdma_events->fd, POLLIN|POLLOUT, 0},
+		{ i2r[ROCE].rdma_events->fd, POLLIN|POLLOUT, 0},
+		{ i2r[INFINIBAND].comp_events->fd, POLLIN|POLLOUT, 0},
+		{ i2r[ROCE].comp_events->fd, POLLIN|POLLOUT,0},
+		{ i2r[INFINIBAND].context->async_fd, POLLIN|POLLOUT, 0},
+		{ i2r[ROCE].context->async_fd, POLLIN|POLLOUT, 0},
+	};
+	int events;
+	int i;
+
+	for(i = 0; i < 2; i++) {
+		/* Receive Buffers */
+		post_receive_buffers(i2r + i);
+		/* And request notifications if something happens */
+		ibv_req_notify_cq(i2r[i].cq, 0);
+	}
+
+loop:
+	events = poll(pfd, 6, timeout);
+
+	if (terminated)
+		goto out;
+
+	if (events < 0) {
+		syslog(LOG_WARNING, "Poll failed with error=%d\n", errno);
+		goto out;
+	}
+	if (events == 0) {
+
+		/* Maintenance tasks */
+		if (nr_mc > active_mc) {
+			join_processing();
+			timeout = 1000;
+		} else {
+			/*
+			 * Gradually increase timeout
+			 * if nothing is really going
+			 * on
+			*/
+			if (!beacon && timeout < 60*60*100)
+				timeout *= 2;
+		}
+
+		status_write();
+
+		syslog(LOG_NOTICE, "ib2roce: %d/%d MC Active. Next Wakeup in %u ms.\n",
+			active_mc, nr_mc, timeout);
+
+		if (beacon)
+			beacon_send();
+
+		goto loop;
+	}
+
+	if (timeout > 5000)
+		timeout = 5000;
+
+	if (pfd[0].revents & (POLLIN|POLLOUT))
+		handle_rdma_event(INFINIBAND);
+
+	if (pfd[1].revents & (POLLIN|POLLOUT))
+		handle_rdma_event(ROCE);
+
+	if (pfd[2].revents & (POLLIN|POLLOUT))
+		handle_comp_event(INFINIBAND);
+
+	if (pfd[3].revents & (POLLIN|POLLOUT))
+		handle_comp_event(ROCE);
+
+	if (pfd[4].revents & (POLLIN|POLLOUT))
+		handle_async_event(INFINIBAND);
+
+	if (pfd[5].revents & (POLLIN|POLLOUT))
+		handle_async_event(ROCE);
+
+	goto loop;
+out:
+	return 0;
+}
+
+/*
+ * Daemon Management functions
+ */
+
+static void terminate(int x)
+{
+	terminated = true;
+}
+
+
+static void update_status(int x)
+{
+	update_requested = true;
+}
+
+static void daemonize(void)
+{
+	pid_t pid;
+
+	if (debug) {
+		openlog("ib2roce", LOG_PERROR, LOG_USER);
+		return;
+	}
+
+	pid = fork();
+
+	if (pid < 0)
+		exit(EXIT_FAILURE);
+
+	if (pid > 0)
+		exit(EXIT_SUCCESS);
+
+	if (setsid() < 0)
+	        exit(EXIT_FAILURE);
+
+	signal(SIGCHLD, SIG_IGN);
+	signal(SIGHUP, SIG_IGN);
+
+	pid = fork();
+
+	if (pid < 0)
+		exit(EXIT_FAILURE);
+
+	/* Terminate parent */
+	if (pid > 0)
+		exit(EXIT_SUCCESS);
+
+	/* Set new file permissions */
+	umask(0);
+
+	if (chdir("/var/lib/ib2roce")) {
+		perror("chdir");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Close all open file descriptors */
+	int x;
+	for (x = sysconf(_SC_OPEN_MAX); x>=0; x--)
+		close (x);
+
+	openlog ("ib2roce", LOG_PID, LOG_DAEMON);
+
+	signal(SIGINT, terminate);
+	signal(SIGTERM, terminate);
+	signal(SIGHUP, terminate);	/* Future: Reload a potential config file */
+	signal(SIGUSR1, update_status);
+}
+
+static int pid_fd;
+
+static void pid_open(void)
+{
+	struct flock fl = {
+		.l_type = F_WRLCK,
+		.l_whence = SEEK_SET,
+		.l_start = 0,
+		.l_len = 0
+	};
+	int n;
+	char buf[10];
+
+	pid_fd = open("ib2roce.pid", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
+
+	if (pid_fd < 0) {
+		syslog(LOG_CRIT, "Cannot open pidfile. Error %d\n", errno);
+		abort();
+	}
+
+	if (fcntl(pid_fd, F_SETLK, &fl) < 0) {
+		syslog(LOG_CRIT, "ib2roce already running.\n");
+		abort();
+	}
+
+	if (ftruncate(pid_fd, 0) < 0) {
+		syslog(LOG_CRIT, "Cannot truncate pidfile. Error %d\n", errno);
+		abort();
+	}
+	n = snprintf(buf, sizeof(buf), "%ld", (long) getpid());
+
+	if (write(pid_fd, buf, n) != n) {
+		syslog(LOG_CRIT, "Cannot write pidfile. Error %d\n", errno);
+		abort();
+	}
+}
+
+static void pid_close(void)
+{
+	unlink("ib2roce.pid");
+	close(pid_fd);
+}
+
+struct option opts[] = {
+	{ "device", required_argument, NULL, 'd' },
+	{ "roce", required_argument, NULL, 'r' },
+	{ "multicast", required_argument, NULL, 'm' },
+	{ "inbound", required_argument, NULL, 'i' },
+	{ "mgid", optional_argument, NULL, 'l' },
+	{ "beacon", no_argument, NULL, 'b' },
+	{ "debug", no_argument, NULL, 'x' },
+	{ "nobridge", no_argument, NULL, 'n' },
+	{ "port", required_argument, NULL, 'p' },
+	{ NULL, 0, NULL, 0 }
+};
+
+int main(int argc, char **argv)
+{
+	int op, ret = 0;
+	int n;
+
+	while ((op = getopt_long(argc, argv, "nbxl::i:r:m:o:d:p:",
+					opts, NULL)) != -1) {
+                switch (op) {
+		case 'd':
+			ib_name = optarg;
+			break;
+
+		case 'r':
+			roce_name = optarg;
+			break;
+
+		case 'm':
+			ret = new_mc_addr(optarg, false, false);
+			break;
+
+		case 'i':
+			ret = new_mc_addr(optarg, false, true);
+			break;
+
+		case 'o':
+			ret =  new_mc_addr(optarg, true, false);
+			break;
+
+		case 'l':
+			if (optarg) {
+				mgid_mode = find_mgid_mode(optarg);
+				if (mgid_mode)
+					break;
+			}
+			printf("List of supported MGID formats via -l<id>\n");
+			printf("=================================\n");
+			printf(" ID    | Signature | Port in MGID\n");
+			printf("-------+-----------+-------------\n");
+			for (n = 0; n < nr_mgid_signatures; n++) {
+				struct mgid_signature *m = mgid_signatures + n;
+
+				printf("%7s|    0x%x | %s\n",
+					m->id, m->signature, m->port_in_mgid ? "true" : "false");
+			}
+			exit(1);
+			break;
+
+		case 'x':
+			debug = true;
+			break;
+
+		case 'b':
+			beacon = true;
+			break;
+
+		case 'n':
+			bridging = false;
+			break;
+
+		case 'p':
+			default_port = atoi(optarg);
+			break;
+
+		default:
+			printf("%s " VERSION " Jan19,2021 (C) 2022 Christoph Lameter <cl@xxxxxxxxx>\n", argv[0]);
+			printf("Usage: ib2roce [<option>] ...\n");
+                        printf("-d|--device <if[:portnumber]>		Infiniband interface\n");
+                        printf("-r|--roce <if[:portnumber]>		ROCE interface\n");
+                        printf("-m|--multicast <multicast address>[:port][/mgidformat] (bidirectional)\n");
+                        printf("-i|--inbound <multicast address>	Incoming multicast only (ib traffic in, roce traffic out)\n");
+                        printf("-o|--outbound <multicast address>	Outgoing multicast only / sendonly (ib trafic out, roce traffic in)\n");
+			printf("-l|--mgid				List availabe MGID formats for Infiniband\n");
+			printf("-l|--mgid <format>			Set default MGID format\n");
+			printf("-x|--debug				Do not daemonize, enter debug mode\n");
+			printf("-p|--port >number>			Set default port number\n");
+			printf("-b|--beacon				Send beacon every second\n");
+			printf("-n|--nobridge				Do everything but do not bridge packets\n");
+                        exit(1);
+		}
+	}
+
+	init_buf();
+
+	daemonize();
+	pid_open();
+
+	ret = find_rdma_devices();
+	if (ret)
+		return ret;
+
+	syslog (LOG_NOTICE, "ib2roce: Infiniband device = %s:%d, ROCE device = %s:%d. Multicast Groups=%d MGIDs=%s Buffers=%u\n",
+			i2r[INFINIBAND].context ? ibv_get_device_name(i2r[INFINIBAND].context->device) : "-",
+			i2r[INFINIBAND].port,
+			i2r[ROCE].context ? ibv_get_device_name(i2r[ROCE].context->device) : "-",
+			i2r[ROCE].port,
+			nr_mc,
+			mgid_mode ? mgid_mode->id : "Default",
+			nr_buffers);
+
+	setup_interface(INFINIBAND);
+	setup_interface(ROCE);
+
+	status_fd = open("ib2roce-status", O_CREAT | O_RDWR,  S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
+
+	if (beacon)
+		beacon_setup();
+
+	event_loop();
+
+	close(status_fd);
+
+	shutdown_roce();
+	shutdown_ib();
+
+	pid_close();
+	syslog (LOG_NOTICE, "ib2roce terminated.");
+	closelog();
+
+	return EXIT_SUCCESS;
+}
diff --git a/ib2roce/man/ib2roce.1 b/ib2roce/man/ib2roce.1
new file mode 100644
index 00000000..d6de7ef9
--- /dev/null
+++ b/ib2roce/man/ib2roce.1
@@ -0,0 +1,111 @@
+.\" Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md
+.\"
+.\" Copyright (C) 2022 Christoph Lameter <cl@xxxxxxxxx>
+.\"
+.TH "ib2roce" 1 "2022-01-20" "ib2roce" "ib2roce" ib2roce
+.SH NAME
+ib2roce \- Infiniband to ROCE gateway
+.SH SYNOPSIS
+.sp
+.nf
+\fIib2roce\fR [-d <infiniband device] [-r <roce device>] [-m <multicast address>] [<other options>]
+.fi
+.SH "DESCRIPTION"
+ib2roce is a bridging tool that monitors and forwards RDMA packets
+between an Infiniband fabric and a Ethernet fabric. On the Ethernet
+side ROCE v2 is used for RDMA communication.
+
+ib2roce currently only supports multicast operations. It is planned
+to add unicast messaging. However, true RDMA operations are not considered
+necessary since the primary role of ib2roce is to support multicast
+messaging on RDMA fabrics that need unicast only as a recovery
+control channel. ib2roce is optimized to provide the highest
+throughput for multicast messaging.
+.SH "OPTIONS"
+.TP
+\-b, --beacon
+ib2roce will sent its status every second on the multicast group
+239.1.2.3:999 on every interface and will log any beacons encountered
+from other instances of ib2roce. At some point ib2roce will gain
+the functionality to operate in a redundant setup and the beacons
+will be used to communicate between ib2roce instances.
+.TP
+\-d, --device  <infiniband-device[:port]>
+Specifies the Infiniband device to use with an optional port number.
+The port defaults to one if not specified.
+
+ib2roce will scan all devices on startup and will use the first
+Infiniband device that is suitable by default. If there are
+multiple Infiniband fabric attached then this option is necessary
+to specify which one to use.
+.TP
+\-i, --inbound <ipv4 multicast address[:port][/<mgid-format]
+A multicast address that should be forwarded from Infiniband
+to ROCE. Traffic in the other direction is ignored.
+.TP
+\-l, --mgid [<format>]
+Specifies which MGID format to use on Infiniband. There are a
+couple of standards on how to encode ipv4 addresses in the
+MGID on Infiniband. This selects
+the method to use and its use is depending on the type of traffic
+that needs to be forwarded between the fabrics.
+
+If no format is specified then ib2roce will display the available
+formats. If no custom mgid format is set for a multicast group
+(the default) then ib2roce will
+perform a join using IPv4 conventions on Infiniband
+and rely on the RDMA subsystem to construct the MGID.
+
+The CLLM MGID format relies on the use of a port number to construct
+the MGID. This means that the port number needs to be correctly
+configured otherwise the MGID will not match and traffic will not
+flow to and from CLLM based applications. Other MGID encoding
+schemes do not include the port. However, the port number is still
+set while interacting with the RDMA subsystem. The number seems
+to be irrelevant at this time for RDMA subsystem functionality.
+.TP
+\-m, --multicast <ipv4 multicast address[:port][/<mgid-format]>
+Specify a multicast address that should be forwarded in both directions.
+The port specification is optional since it is often not relevant for multicast
+traffic forwarding. The mgid format can be specified to override
+the global setting. See the -l option for more details.
+.TP
+\-n, --nobridge
+Run ib2roce without actually forwarding any packets. This is useful
+for debugging since the packet statistics and ib2roce logs will
+show useful information about multicast groups and the amount
+of traffic that may have to be forwarded.
+.TP
+\-o, --outbound <ipv4 multicast address[:port][/<mgid-format]
+A multicast address that should be forwarded from ROCE
+to Infiniband. Traffic in the other direction is ignored. Sendonly joins
+occur to avoid additional traffic on the fabric.
+.TP
+\-p, --port <number>
+Set the default port number. This is used when binding to ROCE and Infiniband
+devices and for multicast groups that do not have a port specified.
+Currently these port numbers seem to be ignored by the RDMA subsystem.
+However, the port number is relevant when an CLLM MGID needs to be
+constructed.
+.TP
+\-r, --roce  <roce-device[:port]>
+Specifies the ROCE Ethernet device to use with an optional port number.
+The port defaults to one if not specified.
+
+ib2roce will scan all devices on startup and will use the first
+Ethernet device that supports ROCE by default. If there are
+multiple ROCE fabrics attached then this option is necessary
+to specify which one to use.
+.TP
+\-x, --debug
+Switches ib2roce into debug mode. ib2roce does not daemonize but
+keeps logging diagnostic output to stderr.
+.SH "NOTES"
+ib2roce creates files in either the current directory or in /var/lib/ib2roce. These
+are the ib2roce.pid file to be able to detect the presence of an already running
+ib2roce instance and the ib2roce-status file that is updated in regular
+intervals with ib2roce information.
+.SH "SEE ALSO"
+roce2ib(7)
+.SH AUTHORS
+Christoph Lameter <cl@xxxxxxxxx>
diff --git a/ib2roce/man/ib2roce.7 b/ib2roce/man/ib2roce.7
new file mode 100644
index 00000000..b8738ae2
--- /dev/null
+++ b/ib2roce/man/ib2roce.7
@@ -0,0 +1,24 @@
+.\" Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md
+.TH "IB2ROCE" 7 "2021-11-15" "IBACM" "IB2ROCE User Guide" IB2ROCE
+.SH NAME
+ib2roce \- Bridge RDMA traffic between infiniband and ROCE
+.SH "DESCRIPTION"
+ib2roce is used to bridge traffic between Infiniband nodes and Ethernet nodes using ROCE to
+enable RDMA native communication between devices on Infiniband with devices on ROCE.
+
+At this point only multicast bridging is supported. The bridge will simply
+replicate multicast traffic between the two fabrics by joining the multicast
+group on either interface and forwarding it to the other.
+
+IB2ROCE prevents multicast loop back through a couple of measures including
+a filter on incoming traffic. Multicast loop back is switched off. If that
+measure is not sufficient, because that feature is not supported or
+local clients need loop back, then
+the filter will do its task.
+.SH "NOTES"
+The Infiniband to ROCE bridge does not support true RDMA memory transfers
+but only messaging in UDP / UD frames. Those are supported in both
+unicast and multicast modes. However, unicast mode is not supported yet
+as of January 2022.
+.SH "SEE ALSO"
+ib2roce(1)




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux