Re: [PATCH rdma-next 05/12] IB/core: Introduce GID entry reference counts for RoCE

Jason Gunthorpe <jgg@xxxxxxxx> · Wed, 16 May 2018 15:48:01 -0600

On Mon, May 14, 2018 at 11:11:11AM +0300, Leon Romanovsky wrote:
> From: Parav Pandit <parav@xxxxxxxxxxxx>
> 
> Currently a GID entry can be deleted from the GID table while it is
> in use by ULPs, CM modules or user space applications.
> It may even get replaced with new attributes and/or GID value which
> may lead to undesired traffic in the network.
> Previous and new GID entry might be in two different net namespace; this
> may lead to send traffic to undesired namespace.
> Provider drivers might be working on GID entry while modifying QP or
> creating address handle, and GID entry can get modified.
> 
> So this patch introduces GID reference count infrastructure without
> actually exposing GID get/put/hold APIs to overcome above issues.
> 
> GID entry reference count is based on standard kref structure.
> A new state GID_TABLE_ENTRY_PENDING_DEL indicates that a given GID
> entry is marked as pending for deletion.
> 
> While it is in pending state, allocator must not reuse such entry.
> When GID entry is in PENDING_DEL state, no more new references can
> be taken using find operation. This ensures that no new processing
> starts for such expiring GID entry.
> 
> GID find APIs ignore the entries which are marked for deletion.
> 
> Signed-off-by: Parav Pandit <parav@xxxxxxxxxxxx>
> Signed-off-by: Leon Romanovsky <leonro@xxxxxxxxxxxx>
>  drivers/infiniband/core/cache.c | 96 ++++++++++++++++++++++++++++++++++-------
>  1 file changed, 80 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
> index 47fa72b434be..f09966b907af 100644
> +++ b/drivers/infiniband/core/cache.c
> @@ -73,9 +73,18 @@ enum gid_table_entry_props {
>  enum gid_table_entry_state {
>  	GID_TABLE_ENTRY_FREE		= 1,
>  	GID_TABLE_ENTRY_ALLOCATED	= 2,
> +	/*
> +	 * Indicates that entry is pending to be removed, there may
> +	 * be active users of this GID entry.
> +	 * When last user of the GID entry releases reference to it,
> +	 * GID state moves to GID_TABLE_ENTRY_FREE.
> +	 */
> +	GID_TABLE_ENTRY_PENDING_DEL	= 3,
>  };
>  
>  struct ib_gid_table_entry {
> +	struct kref			kref;
> +	struct work_struct		del_work;
>  	enum gid_table_entry_props	props;
>  	enum gid_table_entry_state	state;
>  	union ib_gid			gid;
> @@ -166,17 +175,70 @@ is_gid_table_entry_valid(const struct ib_gid_table_entry *entry)
>  	return entry->state == GID_TABLE_ENTRY_ALLOCATED;
>  }
>  
> -static void del_roce_gid(struct ib_device *device, u8 port_num,
> -			 struct ib_gid_table *table, int ix)
> +static void clear_gid_entry(struct ib_gid_table *table,
> +			    struct ib_gid_table_entry *entry)
>  {
> +	write_lock_irq(&table->rwlock);
> +	entry->state = GID_TABLE_ENTRY_FREE;
> +	write_unlock_irq(&table->rwlock);
> +
> +	memcpy(&entry->gid, &zgid, sizeof(zgid));
> +	memset(&entry->attr, 0, sizeof(entry->attr));
> +	entry->context = NULL;
> +}

This reads in the wrong order.. Once the entry is marked free
something could move it to ALLOCATED and race with the memset
here.

However since the table->lock mutex has to be held when calling this
function and during other writes to gid, that shouldn't be possible to
actually happen.. I checked and this seems to be true at least. Still
worth re-ordering for clarity.

Also why memcpy zgid instead of memset? zgid is only really useful
for memcmp, and arguably we should just have a is_zero_gid() function
instead..

>  		/*
>  		 * Additionally find_gid() is used to find valid entry during
> -		 * lookup operation, where validity needs to be checked. So
> -		 * find the empty entry first to continue to search for a free
> -		 * slot and ignore its INVALID flag.
> +		 * lookup operation; so ignore the entries which are marked as
> +		 * pending for removal and the entry which is marked as invalid.
>  		 */

'and the entries which are marked'

>  		if (!is_gid_table_entry_valid(data))
>  			continue;
> @@ -494,7 +552,8 @@ int ib_cache_gid_del_all_netdev_gids(struct ib_device *ib_dev, u8 port,
>  	mutex_lock(&table->lock);
>  
>  	for (ix = 0; ix < table->sz; ix++) {
> -		if (table->data_vec[ix].attr.ndev == ndev) {
> +		if (is_gid_table_entry_valid(&table->data_vec[ix]) &&
> +		    table->data_vec[ix].attr.ndev == ndev) {
>  			del_gid(ib_dev, port, table, ix);
>  			deleted = true;
>  		}
> @@ -741,8 +800,7 @@ static void cleanup_gid_table_port(struct ib_device *ib_dev, u8 port,
>  
>  	mutex_lock(&table->lock);
>  	for (i = 0; i < table->sz; ++i) {
> -		if (memcmp(&table->data_vec[i].gid, &zgid,
> -			   sizeof(table->data_vec[i].gid))) {
> +		if (is_gid_table_entry_valid(&table->data_vec[i])) {
>  			del_gid(ib_dev, port, table, i);
>  			deleted = true;
>  		}
> @@ -1297,6 +1355,12 @@ void ib_cache_cleanup_one(struct ib_device *device)
>  	ib_unregister_event_handler(&device->cache.event_handler);
>  	flush_workqueue(ib_wq);
>  	gid_table_cleanup_one(device);
> +
> +	/*
> +	 * Flush the wq second time for any pending GID delete
> +	 * work for RoCE device.
> +	 */
> +	flush_workqueue(ib_wq);
>  }

Humm.. Either this flush is not needed or it isn't strong enough.

I think we need to wait until the gid table is fully free'd and all
PENDING_DELS are completed ?

Or if we don't then we don't need to wait for any PENDING_DEL so the
flush can go away.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html