On 3/2/20 8:24 AM, Jens Axboe wrote: > On 3/2/20 7:05 AM, Jens Axboe wrote: >> On 3/1/20 9:18 PM, Xiaoguang Wang wrote: >>> When user applis programming mode, like sumbit one sqe and wait its >>> completion event, __io_uring_get_cqe() will result in many unnecessary >>> syscalls, see below test program: >>> >>> int main(int argc, char *argv[]) >>> { >>> struct io_uring ring; >>> int fd, ret; >>> struct io_uring_sqe *sqe; >>> struct io_uring_cqe *cqe; >>> struct iovec iov; >>> off_t offset, filesize = 0; >>> void *buf; >>> >>> if (argc < 2) { >>> printf("%s: file\n", argv[0]); >>> return 1; >>> } >>> >>> ret = io_uring_queue_init(4, &ring, 0); >>> if (ret < 0) { >>> fprintf(stderr, "queue_init: %s\n", strerror(-ret)); >>> return 1; >>> } >>> >>> fd = open(argv[1], O_RDONLY | O_DIRECT); >>> if (fd < 0) { >>> perror("open"); >>> return 1; >>> } >>> >>> if (posix_memalign(&buf, 4096, 4096)) >>> return 1; >>> iov.iov_base = buf; >>> iov.iov_len = 4096; >>> >>> offset = 0; >>> do { >>> sqe = io_uring_get_sqe(&ring); >>> if (!sqe) { >>> printf("here\n"); >>> break; >>> } >>> io_uring_prep_readv(sqe, fd, &iov, 1, offset); >>> >>> ret = io_uring_submit(&ring); >>> if (ret < 0) { >>> fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret)); >>> return 1; >>> } >>> >>> ret = io_uring_wait_cqe(&ring, &cqe); >>> if (ret < 0) { >>> fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret)); >>> return 1; >>> } >>> >>> if (cqe->res <= 0) { >>> if (cqe->res < 0) { >>> fprintf(stderr, "got eror: %d\n", cqe->res); >>> ret = 1; >>> } >>> io_uring_cqe_seen(&ring, cqe); >>> break; >>> } >>> offset += cqe->res; >>> filesize += cqe->res; >>> io_uring_cqe_seen(&ring, cqe); >>> } while (1); >>> >>> printf("filesize: %ld\n", filesize); >>> close(fd); >>> io_uring_queue_exit(&ring); >>> return 0; >>> } >>> >>> dd if=/dev/zero of=testfile bs=4096 count=16 >>> ./test testfile >>> and use bpftrace to trace io_uring_enter syscalls, in original codes, >>> [lege@localhost ~]$ sudo bpftrace -e "tracepoint:syscalls:sys_enter_io_uring_enter {@c[tid] = count();}" >>> Attaching 1 probe... >>> @c[11184]: 49 >>> Above test issues 49 syscalls, it's counterintuitive. After looking >>> into the codes, it's because __io_uring_get_cqe issue one more syscall, >>> indded when __io_uring_get_cqe issues the first syscall, one cqe should >>> already be ready, we don't need to wait again. >>> >>> To fix this issue, after the first syscall, set wait_nr to be zero, with >>> tihs patch, bpftrace shows the number of io_uring_enter syscall is 33. >> >> Thanks, that's a nice fix, we definitely don't want to be doing >> 50% more system calls than we have to... > > Actually, don't think the fix is quite safe. For one, if we get an error > on the __io_uring_enter(), then we may not have waited for entries. Or if > we submitted less than we thought we would, we would not have waited > either. So we need to check for full success before deeming it safe to > clear wait_nr. Unrelated fix: https://git.kernel.dk/cgit/liburing/commit/?id=0edcef5700fd558d2548532e0e5db26cb74d19ca and then a fix for your patch on top: https://git.kernel.dk/cgit/liburing/commit/?id=dc14e30a086082b6aebc3130948e2453e3bd3b2a Can you double check that your original test case still produces the same amount of system calls with the fix in place? -- Jens Axboe