This adds a new io_uring interface to specify meta along with read/write. Beyond reading/writing meta, the interface also enables (a) flags to control data-integrity checks, (b) application tag. Block path (direct IO) and NVMe driver are modified to support this. First 5 patches are enhancements/fixes in the block/nvme so that user meta buffer (mostly when it gets split) is handled correctly. Patch 8 adds the io_uring support. Patch 9 adds the support for block direct IO, and patch 10 for NVMe. Interface: Two new opcodes in io_uring: IORING_OP_READ/WRITE_META. The leftover space in SQE is used to send meta buffer, its length, apptag, and meta flags (guard/reftag/apptag check for now). Example program on how to use the interface is appended below [1] Another design choice will be not to introduce the new opcodes, and add new RWF_META flag instead. Open to that in next version. As for new meta flags, RWF_* seemed a bit precious to use. Hence took the route to carve those within the SQE itself. Performance: of non-meta io is not affected due to these patches. Testing: has been done by modifying fio to use this interface. https://github.com/SamsungDS/fio/commits/feat/test-meta-v2 Changes since RFC: - modify io_uring plumbing based on recent async handling state changes - fixes/enhancements to correctly handle the split for meta buffer - add flags to specify guard/reftag/apptag checks - add support to send apptag [1] #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <string.h> #include <linux/io_uring.h> #include <linux/types.h> #include "liburing.h" /* write data/meta. read both. compare. send apptag too. * prerequisite: * unprotected xfer: format namespace with 4KB + 8b, pi_type = 0 * protected xfer: format namespace with 4KB + 8b, pi_type = 1 */ #define DATA_LEN 4096 #define META_LEN 8 struct t10_pi_tuple { __be16 guard; __be16 apptag; __be32 reftag; }; int main(int argc, char *argv[]) { struct io_uring ring; struct io_uring_sqe *sqe = NULL; struct io_uring_cqe *cqe = NULL; void *wdb,*rdb; char wmb[META_LEN], rmb[META_LEN]; char *data_str = "data buffer"; char *meta_str = "meta"; int fd, ret, blksize; struct stat fstat; unsigned long long offset = DATA_LEN; struct t10_pi_tuple *pi; if (argc != 2) { fprintf(stderr, "Usage: %s <block-device>", argv[0]); return 1; }; if (stat(argv[1], &fstat) == 0) { blksize = (int)fstat.st_blksize; } else { perror("stat"); return 1; } if (posix_memalign(&wdb, blksize, DATA_LEN)) { perror("posix_memalign failed"); return 1; } if (posix_memalign(&rdb, blksize, DATA_LEN)) { perror("posix_memalign failed"); return 1; } strcpy(wdb, data_str); strcpy(wmb, meta_str); fd = open(argv[1], O_RDWR | O_DIRECT); if (fd < 0) { printf("Error in opening device\n"); return 0; } ret = io_uring_queue_init(8, &ring, 0); if (ret) { fprintf(stderr, "ring setup failed: %d\n", ret); return 1; } /* write data + meta-buffer to device */ sqe = io_uring_get_sqe(&ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); return 1; } io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset); sqe->opcode = IORING_OP_WRITE_META; sqe->meta_addr = (__u64)wmb; sqe->meta_len = META_LEN; /* flags to ask for guard/reftag/apptag*/ sqe->meta_flags = META_CHK_APPTAG; sqe->apptag = 0x1234; pi = (struct t10_pi_tuple *)wmb; pi->apptag = 0x3412; ret = io_uring_submit(&ring); if (ret <= 0) { fprintf(stderr, "sqe submit failed: %d\n", ret); return 1; } ret = io_uring_wait_cqe(&ring, &cqe); if (!cqe) { fprintf(stderr, "cqe is NULL :%d\n", ret); return 1; } if (cqe->res < 0) { fprintf(stderr, "write cqe failure: %d", cqe->res); return 1; } io_uring_cqe_seen(&ring, cqe); /* read data + meta-buffer back from device */ sqe = io_uring_get_sqe(&ring); if (!sqe) { fprintf(stderr, "get sqe failed\n"); return 1; } io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset); sqe->opcode = IORING_OP_READ_META; sqe->meta_addr = (__u64)rmb; sqe->meta_len = META_LEN; sqe->meta_flags = META_CHK_APPTAG; sqe->apptag = 0x1234; ret = io_uring_submit(&ring); if (ret <= 0) { fprintf(stderr, "sqe submit failed: %d\n", ret); return 1; } ret = io_uring_wait_cqe(&ring, &cqe); if (!cqe) { fprintf(stderr, "cqe is NULL :%d\n", ret); return 1; } if (cqe->res < 0) { fprintf(stderr, "read cqe failure: %d", cqe->res); return 1; } io_uring_cqe_seen(&ring, cqe); if (strncmp(wmb, rmb, META_LEN)) printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb); if (strncmp(wdb, rdb, DATA_LEN)) printf("Failure: data mismatch!\n"); io_uring_queue_exit(&ring); free(rdb); free(wdb); return 0; } Anuj Gupta (6): block: set bip_vcnt correctly block: copy bip_max_vcnt vecs instead of bip_vcnt during clone block: copy result back to user meta buffer correctly in case of split block: avoid unpinning/freeing the bio_vec incase of cloned bio block: modify bio_integrity_map_user argument io_uring/rw: add support to send meta along with read/write Kanchan Joshi (4): block, nvme: modify rq_integrity_vec function block: define meta io descriptor block: add support to send meta buffer nvme: add separate handling for user integrity buffer block/bio-integrity.c | 69 +++++++++++++++++++++++-------- block/fops.c | 9 +++++ block/t10-pi.c | 6 +++ drivers/nvme/host/core.c | 36 ++++++++++++++++- drivers/nvme/host/ioctl.c | 11 ++++- drivers/nvme/host/pci.c | 9 +++-- include/linux/bio.h | 23 +++++++++-- include/linux/blk-integrity.h | 13 +++--- include/linux/fs.h | 1 + include/uapi/linux/io_uring.h | 15 +++++++ io_uring/io_uring.c | 4 ++ io_uring/opdef.c | 30 ++++++++++++++ io_uring/rw.c | 76 +++++++++++++++++++++++++++++++++-- io_uring/rw.h | 11 ++++- 14 files changed, 276 insertions(+), 37 deletions(-) base-commit: 24c3fc5c75c5b9d471783b4a4958748243828613 -- 2.25.1