On 12/6/2018 10:54 AM, Ruhl, Michael J wrote: >> -----Original Message----- >> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma- >> owner@xxxxxxxxxxxxxxx] On Behalf Of Steve Wise >> Sent: Thursday, December 6, 2018 10:44 AM >> To: dledford@xxxxxxxxxx; jgg@xxxxxxxxxxxx >> Cc: linux-rdma@xxxxxxxxxxxxxxx; BMT@xxxxxxxxxxxxxx; leon@xxxxxxxxxx; >> markb@xxxxxxxxxxxx; yanjun.zhu@xxxxxxxxxx >> Subject: Re: [PATCH v6 3/4] RDMA/Core: add >> RDMA_NLDEV_CMD_NEWLINK/DELLINK support >> >> Commenting on my own submission: >> >> On 12/5/2018 9:14 AM, Steve Wise wrote: >>> Add support for new LINK messages to allow adding and deleting rdma >>> interfaces. This will be used initially for soft rdma drivers which >>> instantiate device instances dynamically by the admin specifying a >>> netdev device to use. The rdma_rxe module will be the first user of >>> these messages. >>> >>> The design is modeled after RTNL_NEWLINK/DELLINK: rdma drivers >>> register with the rdma core if they provide link add/delete functions. >>> Each driver registers with a unique "type" string, that is used to >>> dispatch messages coming from user space. A new RDMA_NLDEV_ATTR >>> is defined for the "type" string. User mode will pass 3 attributes >>> in a NEWLINK message: RDMA_NLDEV_ATTR_DEV_NAME for the desired >> rdma >>> device name to be created, RDMA_NLDEV_ATTR_LINK_TYPE for the "type" >>> of link being added, and RDMA_NLDEV_ATTR_NDEV_NAME for the >> net_device >>> interface to use for this link. The DELLINK message will contain the >>> RDMA_NLDEV_ATTR_DEV_INDEX of the device to delete. >>> >>> Signed-off-by: Steve Wise <swise@xxxxxxxxxxxxxxxxxxxxx> >>> Reviewed-by: Leon Romanovsky <leonro@xxxxxxxxxxxx> >>> --- >>> drivers/infiniband/core/nldev.c | 134 >> +++++++++++++++++++++++++++++++++++++++ >>> include/rdma/ib_verbs.h | 2 + >>> include/rdma/rdma_netlink.h | 13 ++++ >>> include/uapi/rdma/rdma_netlink.h | 11 +++- >>> 4 files changed, 158 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/infiniband/core/nldev.c >> b/drivers/infiniband/core/nldev.c >>> index 9abbadb9e366..b6d97c592074 100644 >>> --- a/drivers/infiniband/core/nldev.c >>> +++ b/drivers/infiniband/core/nldev.c >>> @@ -33,6 +33,7 @@ >>> #include <linux/module.h> >>> #include <linux/pid.h> >>> #include <linux/pid_namespace.h> >>> +#include <linux/mutex.h> >>> #include <net/netlink.h> >>> #include <rdma/rdma_cm.h> >>> #include <rdma/rdma_netlink.h> >>> @@ -107,6 +108,8 @@ >>> [RDMA_NLDEV_ATTR_DRIVER_U32] = { .type = NLA_U32 }, >>> [RDMA_NLDEV_ATTR_DRIVER_S64] = { .type = NLA_S64 }, >>> [RDMA_NLDEV_ATTR_DRIVER_U64] = { .type = NLA_U64 }, >>> + [RDMA_NLDEV_ATTR_LINK_TYPE] = { .type = >> NLA_NUL_STRING, >>> + .len = RDMA_NLDEV_ATTR_ENTRY_STRLEN >> }, >>> }; >>> >>> static int put_driver_name_print_type(struct sk_buff *msg, const char >> *name, >>> @@ -1104,6 +1107,129 @@ static int nldev_res_get_pd_dumpit(struct >> sk_buff *skb, >>> return res_get_common_dumpit(skb, cb, RDMA_RESTRACK_PD); >>> } >>> >>> +static LIST_HEAD(link_ops); >>> +static DECLARE_RWSEM(link_ops_rwsem); >>> + >>> +static const struct rdma_link_ops *link_ops_get(const char *type) >>> +{ >>> + const struct rdma_link_ops *ops; >>> + >>> + list_for_each_entry(ops, &link_ops, list) { >>> + if (!strcmp(ops->type, type)) >>> + goto out; >>> + } >>> + ops = NULL; >>> +out: >>> + return ops; >>> +} >>> + >>> +void rdma_link_register(struct rdma_link_ops *ops) >>> +{ >>> + down_write(&link_ops_rwsem); >>> + if (link_ops_get(ops->type)) { >>> + WARN_ONCE("Duplicate rdma_link_ops! %s\n", ops->type); >>> + goto out; >>> + } >>> + list_add(&ops->list, &link_ops); >>> +out: >>> + up_write(&link_ops_rwsem); >>> +} >>> +EXPORT_SYMBOL(rdma_link_register); >>> + >>> +void rdma_link_unregister(struct rdma_link_ops *ops) >>> +{ >>> + down_write(&link_ops_rwsem); >>> + list_del(&ops->list); >>> + up_write(&link_ops_rwsem); >>> +} >>> +EXPORT_SYMBOL(rdma_link_unregister); >>> + >>> +static int nldev_newlink(struct sk_buff *skb, struct nlmsghdr *nlh, >>> + struct netlink_ext_ack *extack) >>> +{ >>> + struct nlattr *tb[RDMA_NLDEV_ATTR_MAX]; >>> + char ibdev_name[IB_DEVICE_NAME_MAX]; >>> + const struct rdma_link_ops *ops; >>> + struct ib_device *device; >>> + char ndev_name[IFNAMSIZ]; >>> + char type[IFNAMSIZ]; >>> + int err; >>> + >>> + err = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, >>> + nldev_policy, extack); >>> + if (err || !tb[RDMA_NLDEV_ATTR_DEV_NAME] || >>> + !tb[RDMA_NLDEV_ATTR_LINK_TYPE] || >> !tb[RDMA_NLDEV_ATTR_NDEV_NAME]) >>> + return -EINVAL; >>> + >>> + nla_strlcpy(ibdev_name, tb[RDMA_NLDEV_ATTR_DEV_NAME], >>> + sizeof(ibdev_name)); >>> + if (strchr(ibdev_name, '%')) >>> + return -EINVAL; >>> + >>> + nla_strlcpy(type, tb[RDMA_NLDEV_ATTR_LINK_TYPE], sizeof(type)); >>> + nla_strlcpy(ndev_name, tb[RDMA_NLDEV_ATTR_NDEV_NAME], >>> + sizeof(ndev_name)); >>> + >>> + down_read(&link_ops_rwsem); >>> + ops = link_ops_get(type); >>> +#ifdef CONFIG_MODULES >>> + if (!ops) { >>> + up_read(&link_ops_rwsem); >>> + request_module("rdma-link-%s", type); >>> + down_read(&link_ops_rwsem); >>> + ops = link_ops_get(type); >>> + } >>> +#endif >>> + if (ops) { >>> + device = ops->newlink(ibdev_name, ndev_name); >>> + if (IS_ERR(device)) >>> + err = PTR_ERR(device); >>> + else >>> + device->link_ops = ops; >>> + } else { >>> + err = -ENODEV; >>> + } >>> + up_read(&link_ops_rwsem); >>> + >>> + return err; >>> +} >>> + >>> +static int nldev_dellink(struct sk_buff *skb, struct nlmsghdr *nlh, >>> + struct netlink_ext_ack *extack) >>> +{ >>> + struct nlattr *tb[RDMA_NLDEV_ATTR_MAX]; >>> + const struct rdma_link_ops *ops; >>> + struct device *dma_device; >>> + struct ib_device *device; >>> + u32 index; >>> + int err; >>> + >>> + err = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1, >>> + nldev_policy, extack); >>> + if (err || !tb[RDMA_NLDEV_ATTR_DEV_INDEX]) >>> + return -EINVAL; >>> + >>> + index = nla_get_u32(tb[RDMA_NLDEV_ATTR_DEV_INDEX]); >>> + device = ib_device_get_by_index(index); >>> + if (!device) >>> + return -EINVAL; >>> + >>> + ops = device->link_ops; >>> + >>> + /* >>> + * Deref the ib_device before deleting it. Otherwise we >>> + * deadlock unregistering the device. Hold a ref on the >>> + * underlying dma_device though to keep the memory around >>> + * until we're done. >>> + */ >>> + dma_device = get_device(device->dma_device); >>> + ib_device_put(device); >>> + err = ops ? ops->dellink(device) : -ENODEV; >>> + put_device(dma_device); >>> + >>> + return err; >>> +} >>> + >> >> There is still a problem with the above code. If thread 1 is running >> this code and gets up to just after ib_device_put() but before calling >> ops->dellink, and then thread2 runs the rmmod code of rxe, thread 2 can >> remove the device and complete the module unload. Then thread 1 >> continues and calls the ops->dellink function which has been unloaded. >> >> I think the above code needs a module reference before the >> ib_device_put(), and a module dereference after calling ops->dellink. >> >> Thoughts? > Would using the rwsem around this accomplish that? That wouldn't help, unless nldev_dellink() looked up the link_ops with the rwsem held, instead of having it cached in the ib_device struct. Which is sort of what I had in an earlier version. But since nldev_dellink() found the ib_device via ib_device_get_by_index(), we know the rxe module cannot unload until ib_device_put() is called by nldev_dellink(). I don't know how to stall the unload until nldev_dellink() finishes other than a module reference.