On Tue, 2016-06-07 at 12:55 +0200, Christoph Hellwig wrote: > There is absolutely no point in dragging in an overcomplicated configfs > structure for a very simple protocol which also is very different from > SCSI in it's nitty gritty details. Please be more specific wrt the two individual points that have been raised. > Keeping the nvme target self contains > allows it to be both much simpler and much easier to understand, as well > as much better testable - see the amount of test coverage we could easily > add for example. I disagree. > > Or to put it the other way around - if there was any major synergy in > reusing the SCSI target code that just shows we're missing functionality > in the block layer or configfs. > To reiterate the points again. *) Extensible to multiple types of backend drivers. nvme-target needs a way to absorb new backend drivers, that does not effect existing configfs group layout or attributes. Looking at the nvmet/configfs layout as-is, there are no multiple backend types defined, nor a way to control backend feature bits exposed to nvme namespaces at runtime. What is being proposed is a way to share target-core backends via existing configfs symlinks across SCSI and NVMe targets. Which means: - All I/O state + memory submission is done at RCU protected se_device level via sbc_ops - percpu reference counting is done outside of target-core - Absorb all nvmet/io-cmd optimizations into target_core_iblock.c - Base starting point for features in SCSI + NVMe that span across multiple endpoints and instances (reservations + APTPL, multipath, copy-offload across fabric types) Using target-core backends means we get features like T10-PI and sbc_ops->write_same for free that don't exist in nvmet, and can utilize a common set of backend drivers for SCSI and NVMe via an existing configfs ABI and python userspace community. And to the second, and more important point for defining a configfs ABI that works for both today's requirements, as well into the 2020s without breaking user-space compatibility. As-is, the initial design using top level nvmet configfs symlinks of subsystem groups into individual port + host groups does not scale. That is, it currently does: - Sequential list lookup under global rw_mutex of top-level nvmet_port and nvmet_host symlink ->allow_link() and ->drop_link() configfs callbacks. - nvmet_fabrics_ops->add_port() callback invoked under same global rw mutex. This is very bad for several reasons. As-is, this blocks all other configfs port + host operations from occurring even during normal operation, which makes it quite useless for any type of multi-tenant target environment where the individual target endpoints *must* be able to operate independently. Seriously, there is never a good reason why configfs group or item callbacks should be performing list lookup under a global lock at this level. Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? A simple example where this design breaks down quickly is a NVMf ops->add_port() call that requires a HW reset, or say reloading of firmware that can take multiple seconds. (qla2xxx comes to mind). There is a simple test to highlight this limitation. Take any nvme-target driver that is capable of multiple ports, and introduce a sleep(5) into each ops->add_port() call. Now create 256 different subsystem NQNs with 256 different ports across four different user-space processes. What happens to other subsystems, ports and host groups configfs symlinks when this occurs..? What happens to the other user-space processes..? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html