Hi HCH & Co, On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote: > This patch set adds a generic NVMe over Fabrics target. The > implementation conforms to the NVMe 1.2b specification (which > includes Fabrics) and provides the NVMe over Fabrics access > to Linux block devices. > Thanks for all of the development work by the fabric_linux_driver team (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. Very excited to see this code get a public release now that NVMf specification is out. Now that it's in the wild, it's a good opportunity to discuss some of the more interesting implementation details, beyond the new NVMf wire-protocol itself. (Adding target-devel + linux-scsi CC') > The target implementation consists of several elements: > > - NVMe target core: defines and manages the NVMe entities (subsystems, > controllers, namespaces, ...) and their allocation, responsible > for initial commands processing and correct orchestration of > the stack setup and tear down. > > - NVMe admin command implementation: responsible for parsing and > servicing admin commands such as controller identify, set features, > keep-alive, log page, ...). > > - NVMe I/O command implementation: responsible for performing the actual > I/O (Read, Write, Flush, Deallocate (aka Discard). It is a very thin > layer on top of the block layer and implements no logic of it's own. > To support exporting file systems please use the loopback block driver > in direct I/O mode, which gives very good performance. > > - NVMe over Fabrics support: responsible for servicing Fabrics commands > (connect, property get/set). > > - NVMe over Fabrics discovery service: responsible to serve the Discovery > log page through a special cut down Discovery controller. > > The target is configured using configfs, and configurable entities are: > > - NVMe subsystems and namespaces > - NVMe over Fabrics ports and referrals > - Host ACLs for primitive access control - NVMe over Fabrics access > control is still work in progress at the specification level and > will be implemented once that work has finished. > > To configure the target use the nvmetcli tool from > http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed > setup documentation. > > In addition to the Fabrics target implementation we provide a loopback > driver which also conforms the NVMe over Fabrics specification and allows > evaluation of the target stack with local access without requiring a real > fabric. > So as-is, I have two main objections that been discussed off-list for some time, that won't be a big surprise to anyone following fabrics_linux_driver list. ;P First topic, I think nvme-target name-spaces should be utilizing existing configfs logic, and sharing /sys/kernel/config/target/core/ backend driver symlinks as individual nvme-target subsystem namespaces. That is, we've already got a configfs ABI in place for target mode back-ends that today is able to operate independently from SCSI architecture model dependencies. To that end, the prerequisite series to allow target-core backends to operate independent of se_cmd, and allow se_device backends to be configfs symlinked directly into /sys/kernel/config/nvmet/, outside of /sys/kernel/config/target/$FABRIC/ has been posted earlier here: http://marc.info/?l=linux-scsi&m=146527281416606&w=2 Note the -v2 series has absorbed the nvmet/io-cmd execute_rw() improvements from Sagi + Ming (inline bio/bvec and blk_poll) into target_core_iblock.c driver code. Second topic, and more important from a kernel ABI perspective are the current scale limitations around the first pass of nvmet configfs.c layout code in /sys/kernel/config/nvmet/. Namely, the design of having three top level configfs groups in /sys/kernel/config/nvmet/[subsystems,ports,hosts] that are configfs symlinked between each other, with a single rw_mutex (nvmet_config_sem) used for global list lookup and enforcing a globally synchronized nvmet_fabrics_ops->add_port() creation across all subsystem NQN ports. >From the shared experience in target_core_fabric_configfs.c over the last 8 years, perhaps the greatest strength of configfs has been it's ability to allow config_item_type parent/child relationships to exist and operate independently of one another. Specifically in the context of storage tenants, this means creation + deletion of one backend + target fabric endpoint tenant, should not block creation + deletion of another backend + target fabric endpoint tenant. As-is, a nvmet configfs layout holding a global mutex across subsystem/port/host creation + deletion, and doing internal list lookup within configfs ->allow_link + ->drop_link callbacks ends up being severely limiting when scaling up the total number of nvmet subsystem NQNs and ports. Specifically, modern deployments of /sys/kernel/config/target/iscsi/ expect backends + fabric endpoints to be configured in parallel at < 100ms from user-space, in order to actively migrate and fail-over 100s of storage instances (eg: iscsi IQNs -> NVMf NQN) across physical cluster nodes and L3 networks. So in order to reach this level of scale with nvmet/configfs, the layout I think is necessary to match iscsi-target in a multi-tenant environment will, in it's most basic form look like: /sys/kernel/config/nvmet/subsystems/ └── nqn.2003-01.org.linux-iscsi.NVMf.skylake-ep ├── hosts ├── namespaces │ └── ns_1 │ └── 1 -> ../../../../../../target/core/rd_mcp_1/ramdisk0 └── ports ├── pcie:$SUPER_TURBO_FABRIC_EAST ├── pcie:$SUPER_TURBO_FABRIC_WEST ├── rdma:[$IPV6_ADDR]:$PORT ├── rdma:10.10.1.75:$PORT └── loop That is, both NQN ports groups and host ACL groups exist below the nvmet_subsys->group, and NQN namespaces are configfs symlinked directly from /sys/kernel/config/target/core/ backends as mentioned in point #1. To that end, I'll be posting a nvmet series shortly that implements a multi-tenant configfs layout WIP using nvme/loop, using existing target-core backends as configfs symlinked nvme namespaces. Comments..? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html