Re: [PATCH] nvme: Revert: Fix controller creation races with teardown flow

Sagi Grimberg <sagi@xxxxxxxxxxx> · Fri, 28 Aug 2020 16:59:45 -0700

This is indeed a regression.

Perhaps we should also revert:
12a0b6622107 ("nvme: don't hold nvmf_transports_rwsem for more than 
transport lookups")

Which inherently caused this by removing the serialization of
.create_ctrl()...

no, I believe the patch on the semaphore is correct. Otherwise - things 
can be blocked a long time.. a minute (1 cmd timeout) or even multiple 
minutes in the case where a command failure in core layers effectively 
gets ignored and thus doesn't cause the error path in the transport. 
There can be multiple /dev/nvme-fabrics commands stacked up that can 
make the delays look much longer to the last guy.

as far as creation vs teardown... yeah, not fun, but there are other 
ways to deal with it. FC: I got rid of the separate create/reconnect 
threads a while ago thus the return-control-while-reconnecting behavior, 
so I've had to deal with it.  It's one area it'd be nice to see some 
convergence in implementation again between transports.

Doesn't fc have a bug there? in create_ctrl after flushing the
connect_work, what is telling it if delete is running in with it
(or that it already ran...)