(Refer to diagram in spdk_drbd.pdf) tcmu-runner block storage handlers running under SPDK ===================================================== A prototype of a new block device module "bdev_tcmur" running under the Storage Performance Development Kit allows access to block storage using tcmu-runner handlers. (tcmu-runner itself is not involved; only its loadable handlers are used here.) The bdev_tcmur module is based on the bdev_aio module source. It enables the pathways for LUN 2 and LUN 3 shown in the diagram. Distributed Replicated Block Device (DRBD 9.0) running in usermode ================================================================== A recent project ported DRBD from the kernel to run in usermode as a Linux process, using support from emulated kernel functions and a multi-threaded engine based on epoll_wait(). The DRBD source code itself is unmodified, with its expected environment simulated around it. It receives requests from clients through the kernel's block-I/O ("bio") protocol, and also makes requests to its backing storage using that same protocol. Usermode DRBD can be plumbed under Usermode SCST (not shown in this diagram), or under a FUSE interface (drbd1 in the diagram). DRBD running with SPDK ====================== To bring usermode DRBD into an SPDK process, a new SPDK bdev module "bdev_bio" implements translation of SPDK block device requests into the kernel's block-I/O ("bio") protocol, as expected by DRBD. This enables the pathways for LUN 4 and LUN 5 shown in the diagram. DRBD then makes bio requests to its backing storage, which at present must be a tcmu-runner device. To support arbitrary SPDK devices (e.g. use Malloc0 to back a DRBD device) requires a "bio_bdev" module to translate bio requests into SPDK bdev protocol. (TBD) The SPDK configuration file plus an external helper provide enough for SPDK to configure DRBD with the devices needed by SPDK. Once the SPDK+DRBD server is up and running, the DRBD logic can be controlled using the native DRBD management commands (drbdsetup and drbdadm). The emulated kernel functions (UMC - usermode compatibility) make use of services provided by a multithreaded event engine (MTE) implemented around epoll_wait(). The MTE services are accessed by UMC through an ops vector backed by MTE services for memory, time, and threads, as well as event polling of file descriptors, timers, and a FIFO of work to be done ASAP. I anticipate an easy time converting the ops vector to point at a shim to SPDK services in place of MTE calls. Limitations =========== The implementation is very new. So far I have mainly tested it using the SPDK iSCSI server, exporting tcmu-runner backend devices as SCSI LUNs. That seems to work reliably. The drbd and tcmur devices can alternatively be mounted locally through the FUSE interface, which also works. I have only tried it with one reactor core. This prototype implementation is clearly in need of some cleaning up and interfaces straightened out. I've been studying SPDK for less then two weeks, and I guessed at a few things that I need to go back over carefully. But it runs. The makefiles have optimizations turned off and debugs turned on. The UMC FUSE implementation is single-threaded and synchronous; thus it operates at an effective queue depth of one. This matters most when using it to access replicated volumes with DRBD Protocol C, where performance will suffer significantly. Accessing volumes with Protocol A configured to "pull-ahead" performs reasonably, as does accessing the same data through an iSCSI LUN, which does not have the QD=1 limitation, NOTE: Only tcmu-runner modules handler_ram.so and handler_file.so have been tried so far; the latter is significantly faster, so it is the one specified in the example configuration files. An *async* tcmu-runner handler (nr_threads == 0) has yet to be tried! Usermode DRBD Limitations ========================= Netlink multicast emulation not yet implemented, so anything like "drbdsetup wait*" hangs. The bio block device nodes are exposed through a mount of the server's UMC fuse filesystem implementation. The fuse-tree node that represents a DRBD or TCMUR block device appears as a regular file rather than as a block device (because otherwise fuse directs I/O for that dev_t to the kernel instead of the fuse filesystem server). So when communicating with a usermode server, the DRBD utilities are modified to omit the check that their device is S_IFBLK() rather than S_IFREG(). Messages from the utilities and in the logs have not been modified, so will still refer to "the kernel" etc when referring to code that has been ported from the kernel to usermode. Resync may run noticeably slower when observing resync network traffic with tcpdump. Something I expect NOT to work is running the server executable off of a disk it implements. I have only run the usermode server on machines without DRBD installed in the kernel. The build script and the config/run instructions assume that there are no DRBD modules or utilities installed. (That would likely be very confusing, but might actually work if assigned separate ports) Bugs ==== In general only the "happy path" has received any exercise -- expect bugs in untested error- handling logic. "Exclusive" opens aren't really exclusive, so be careful not to mount the same storage twice; for example /UMCfuse/dev/file_c and /UMCfuse/dev/drbd1 are the same storage in the example configuration. For another example, SPDK configuration [BIO] for bdev_bio should never consume both drbd2 and ram_b concurrently. "Holders" and "claims" are not yet implemented. The "writable" bits in the mode permissions do not appear correctly in /UMCfuse/dev. The server apparently can mount and write a replicated DRBD device on a secondary node. fsync/flush is probably ineffective. 4096 is the only tested block size; possible bugs with others. Stacktrace is broken. Probably there are broken untested refcountings on things that usually only get opened once. (E.g. two concurrent dd commands to the same device or things like that). Clean shutdown does not work at all. I always "make clean" before make, because my makefiles don't calculate dependencies right. The makefiles are hateworthy. SCST repository is unnecessarily tangled up with the build. Sometimes DRBD resync doesn't start upon reconnect after restarting the server. If it doesn't start, disconnecting + reconnecting to the peer usually gets it going. I have seen a very weird problem using the tcmu-runner handler_file.so. After dlopen(), libtcmur.c looks up the symbol for the handler_init routine and calls it. The handler calls back with the address of its ops vector. The function addresses in the ops vector are properly relocated for the loaded module, and the main module calls functions through the ops vector thousands of times... and then suddenly SIGSEGV, and examining the ops vector (under gdb) the function addresses are all back to their original UNRELOCATED relative values! (And the faulting program counter address matches the unrelocated value in the member of the ops vector it was trying to call through.) I have never seen this happen with handler_ram. However, I have not seen the problem since I ensured adequate memory for the SPDK server. The SPDK test machine has "only" 4GiB RAM, and swap space used was increasing during problem tests. Because handler_file runs significantly faster than handler_ram for mounted filesystems, all the tcmu-runner handler devices in the example are now by default configured to use handler_file (despite some names in /UMCfuse/dev and /tmp continuing to be called "ram" rather than "file"). Building from Source Code ========================= The source code to build SPDK with support for tcmu-runner handlers is in my forks of the SPDK and tcmu-runner repositories. Building-in DRBD support requires several additional repositories. Because building is presently a mess, I've included scripts that will download the repositories and build SPDK with support for tcmu-runner loadable handlers and/or DRBD. To download and build the SPDK iSCSI server with support for BOTH, cd into an empty directory and do: wget https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/BUILD_spdk_drbd.sh chmod 755 BUILD_spdk_drbd.sh ./BUILD_spdk_drbd.sh To OMIT DRBD and only download/build SPDK with support for tcmu-runner handlers do this instead: wget https://raw.githubusercontent.com/DavidButterfield/spdk/tcmu-runner/BUILD_spdk_tcmur.sh chmod 755 BUILD_spdk_tcmur.sh ./BUILD_spdk_tcmur.sh The (former) DRBD script downloads and builds a superset of what the (latter) TCMUR script does, and after the DRBD download you can specify to build the more limited server (to support TCMUR but not DRBD) by selection of configuration options: --with-tcmur # SPDK with tcmu-runner only --with-tcmur --with-drbd # SPDK with DRBD and tcmu-runner Comments in the download/build scripts document the process in case you want to do some steps manually. (It asks for the sudo password to install, so you might want to look at it first.) The SCRIPTS ASSUME you already have the tools and libraries installed such that you can build the standard SPDK, DRBD, and tcmu-runner repositories. Some of the makefiles require various build tools -- here are package names I added to a fresh installation of Ubuntu 18.04 LTS to complete the build: build-essential g++ gcc git make gdb valgrind cscope exuberant-ctags libfuse-dev libaio-dev libglib2.0-dev libkmod-dev libnl-3-dev libnl-genl-3-dev librbd-dev autoconf automake flex coccinelle cmake I always "make clean" before "make", because my makefiles don't calculate dependencies right. There should be no compile errors, but there will be some warnings in the DRBD code. The build script documents a few that are expected and can be ignored for now. Configuring =========== The example config files in etc/drbd.d are from a node in my setup. They will have to be modified to suit your network configuration, and put into /etc/drbd.d on your test system. There is also a nasty "helper" script /usr/sbin/drbdadm_up_primary which at present can only bring up one specific SPDK/DRBD device in the example configuration. To support a different configuration, that file probably needs updating (in addition to /etc/drbd.d/* and the SPDK configuration file). Running ======= To run the DRBD management utilities so that they refer to the simulated /proc that talks to the usermode server process (rather than the real /proc that talks to the kernel): export UMC_FS_ROOT=/UMCfuse # *** SET ENVIRONMENT *** The utilities need the $UMC_FS_ROOT environment variable set to control the usermode DRBD server instead of a kernel-based server. But they also need to run superuser. Keep in mind that the sudo program does not pass your shell environment through to the program given on its command line, unless you specify "sudo -E". (Omitting the "-E" leads to bewildering non-sequitur error messages because the utility is trying to parse an earlier version of the command language) Also the *server* needs the $UMC_FS_ROOT environment variable set, because it invokes the utilities through a "usermode helper", and they inherit the variable from the server. The download/build script ends with a suggested server command-line, that depends on which script you used. The two scripts refer to different configuration files depending on whether DRBD support was selected or not. Troubleshooting =============== If you didn't read the sections "Configuring" and "Running" just above, read those. The implementation and configuration of SPDK+DRBD is an order of magnitude more complex than the relatively straightforward implementation of tcmu-runner handlers under SPDK. You may wish to make sure the simpler case works before bringing in DRBD. Make sure your configuration files were suitably modified for your names, addresses, etc. Make sure you are running the server and the utilities with environment variable set: export UMC_FS_ROOT=/UMCfuse sudo -E drbdadm ... # -E to pass the environment variable through sudo Missing the environment variable leads to bewildering non-sequitur error messages because the utility is trying to parse an earlier version of the command language. These messages in the server log or output from a DRBD utility probably mean the environment variable is not set: Cannot determine minor device number of device Missing connection endpoint argument Parse error: 'disk | device | address | meta-disk | flexible-meta-disk' expected, but got 'node-id' /proc and /sys/module entries for the DRBD usermode server can be observed under /UMCfuse. After starting the server, a node should appear in /UMCfuse/dev for each bio or tcmu-runner device configured by SPDK. DRBD resource "nonspdk" (drbd1) is not configured as an SPDK device. After the server is up the resource may be enabled using the native DRBD command, after which its node should appear under /UMCfuse/dev: drbdadm up nonspdk # assumes metadata previously created Multiple names can refer to the same underlying storage. Referring to the diagram, LUN 5, bio1, /UMCfuse/dev/drbd2, and /UMCfuse/dev/ram_b all refer to the same underlying storage in /tmp/tcmur_ram01. A filesystem can be mounted on an iSCSI initiator as LUN 5, or the same filesystem can be mounted locally, e.g. sudo mount /UMCfuse/dev/drbd2 /mnt/x One bug is that exclusive open is not currently exclusive, so be careful not to use storage multiple ways at the same time! More Information ================ The DRBD kernel source code ported to usermode is (within a dozen lines of) unmodified from the original code in the LINBIT repository, with its expected kernel environment simulated around it. For more information about how that was done, see the README.md with diagrams at https://github.com/DavidButterfield/SCST-Usermode-Adaptation David Butterfield Tue 17 Sep 2019 09:43:35 PM MDT
Attachment:
spdk_drbd.pdf
Description: Adobe PDF document