On 4/28/2015 8:43 PM, Jason Gunthorpe wrote:
On Tue, Apr 28, 2015 at 05:03:11PM +0300, Matan Barak wrote:
The cleanup of roce_gid_cache is done in a different context, so we
need to make sure the device is still alive while doing so.
This explanation doesn't look right to me. I don't see anything like
that under roce_gid_cache_cleanup_one ?
Although, there must be a call to the driver's modify_gid to free
context before freeing, and I don't see that obviously happening..
Of course there is:
roce_gid_cache_client_cleanup_one ->
roce_gid_cache_client_cleanup_one_work -> *work* ->
roce_gid_cache_client_cleanup_work_handler -> roce_gid_cache_cleanup_one
-> free_roce_gid_cache -> write_gid -> modify_gid
However, all the other async work launched doesn't look safe at all.
I think it's safe - roce_gid_cache_client_cleanup_one deactivates the
cache (no new events are handled on the cache). Ongoing events are
flushed in roce_gid_cache_client_cleanup_one_work
(flush_workqueue(roce_gid_mgmt_wq)). When roce_gid_cache_cleanup_one is
called - no work will be done or is currently in process on this cache.
So, did you mean that the device must still be alive while all the
other work is running? And the point of this scheme is to guarentee
all the work is flushed? (at least I hope it is, otherwise there are
bigger problems here)
It's not safe to free the client's context in ib_unregister_device while
the client isn't done. The obvious solution is to wait in client->remove
(like you suggested) until the client has finished cleaning up things.
This doesn't fit our case - since client->remove is called under
device_lock, but it's possible (for example) that
roce_rescan_device_work_handler is currently running and waits to grab
this exact mutex - DEAD LOCK.
Freeing thing asynchronously is a possible solution - we don't lock all
IB devices but just our device. Moreover, it's also more future proof as
other clients might want to run other things concurrently.
It is just fundamentally wrong to return from ib_client.remove while
async work is still outstanding, the client is expected to deal with
this internally and synchronously.
You don't need IB core help to do this.
netdev is taking a similar approach - please take a look at
netdev_wait_allrefs
Or: This should have been fixed after Haggai brought it up...
Jason
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html