Re: [PATCH] __io_uring_get_cqe: eliminate unnecessary io_uring_enter() syscalls

Jens Axboe <axboe@xxxxxxxxx> · Wed, 4 Mar 2020 06:57:05 -0700

On 3/4/20 6:27 AM, Xiaoguang Wang wrote:
> hi，
> 
>> On 3/2/20 8:24 AM, Jens Axboe wrote:
>>> On 3/2/20 7:05 AM, Jens Axboe wrote:
>>>> On 3/1/20 9:18 PM, Xiaoguang Wang wrote:
>>>>> When user applis programming mode, like sumbit one sqe and wait its
>>>>> completion event, __io_uring_get_cqe() will result in many unnecessary
>>>>> syscalls, see below test program:
>>>>>
>>>>>      int main(int argc, char *argv[])
>>>>>      {
>>>>>              struct io_uring ring;
>>>>>              int fd, ret;
>>>>>              struct io_uring_sqe *sqe;
>>>>>              struct io_uring_cqe *cqe;
>>>>>              struct iovec iov;
>>>>>              off_t offset, filesize = 0;
>>>>>              void *buf;
>>>>>
>>>>>              if (argc < 2) {
>>>>>                      printf("%s: file\n", argv[0]);
>>>>>                      return 1;
>>>>>              }
>>>>>
>>>>>              ret = io_uring_queue_init(4, &ring, 0);
>>>>>              if (ret < 0) {
>>>>>                      fprintf(stderr, "queue_init: %s\n", strerror(-ret));
>>>>>                      return 1;
>>>>>              }
>>>>>
>>>>>              fd = open(argv[1], O_RDONLY | O_DIRECT);
>>>>>              if (fd < 0) {
>>>>>                      perror("open");
>>>>>                      return 1;
>>>>>              }
>>>>>
>>>>>              if (posix_memalign(&buf, 4096, 4096))
>>>>>                      return 1;
>>>>>              iov.iov_base = buf;
>>>>>              iov.iov_len = 4096;
>>>>>
>>>>>              offset = 0;
>>>>>              do {
>>>>>                      sqe = io_uring_get_sqe(&ring);
>>>>>                      if (!sqe) {
>>>>>                              printf("here\n");
>>>>>                              break;
>>>>>                      }
>>>>>                      io_uring_prep_readv(sqe, fd, &iov, 1, offset);
>>>>>
>>>>>                      ret = io_uring_submit(&ring);
>>>>>                      if (ret < 0) {
>>>>>                              fprintf(stderr, "io_uring_submit: %s\n", strerror(-ret));
>>>>>                              return 1;
>>>>>                      }
>>>>>
>>>>>                      ret = io_uring_wait_cqe(&ring, &cqe);
>>>>>                      if (ret < 0) {
>>>>>                              fprintf(stderr, "io_uring_wait_cqe: %s\n", strerror(-ret));
>>>>>                              return 1;
>>>>>                      }
>>>>>
>>>>>                      if (cqe->res <= 0) {
>>>>>                              if (cqe->res < 0) {
>>>>>                                      fprintf(stderr, "got eror: %d\n", cqe->res);
>>>>>                                      ret = 1;
>>>>>                              }
>>>>>                              io_uring_cqe_seen(&ring, cqe);
>>>>>                              break;
>>>>>                      }
>>>>>                      offset += cqe->res;
>>>>>                      filesize += cqe->res;
>>>>>                      io_uring_cqe_seen(&ring, cqe);
>>>>>              } while (1);
>>>>>
>>>>>              printf("filesize: %ld\n", filesize);
>>>>>              close(fd);
>>>>>              io_uring_queue_exit(&ring);
>>>>>              return 0;
>>>>>      }
>>>>>
>>>>> dd if=/dev/zero of=testfile bs=4096 count=16
>>>>> ./test  testfile
>>>>> and use bpftrace to trace io_uring_enter syscalls, in original codes,
>>>>> [lege@localhost ~]$ sudo bpftrace -e "tracepoint:syscalls:sys_enter_io_uring_enter {@c[tid] = count();}"
>>>>> Attaching 1 probe...
>>>>> @c[11184]: 49
>>>>> Above test issues 49 syscalls, it's counterintuitive. After looking
>>>>> into the codes, it's because __io_uring_get_cqe issue one more syscall,
>>>>> indded when __io_uring_get_cqe issues the first syscall, one cqe should
>>>>> already be ready, we don't need to wait again.
>>>>>
>>>>> To fix this issue, after the first syscall, set wait_nr to be zero, with
>>>>> tihs patch, bpftrace shows the number of io_uring_enter syscall is 33.
>>>>
>>>> Thanks, that's a nice fix, we definitely don't want to be doing
>>>> 50% more system calls than we have to...
>>>
>>> Actually, don't think the fix is quite safe. For one, if we get an error
>>> on the __io_uring_enter(), then we may not have waited for entries. Or if
>>> we submitted less than we thought we would, we would not have waited
>>> either. So we need to check for full success before deeming it safe to
>>> clear wait_nr.
>>
>> Unrelated fix:
>>
>> https://git.kernel.dk/cgit/liburing/commit/?id=0edcef5700fd558d2548532e0e5db26cb74d19ca
>>
>> and then a fix for your patch on top:
>>
>> https://git.kernel.dk/cgit/liburing/commit/?id=dc14e30a086082b6aebc3130948e2453e3bd3b2a
> In this patch, seesms that you forgot to delete:
>      if (wait_nr)
>          wait_nr = 0;
> 
> With these two codes removed, my original test case still produces the same amount
> of io_uring_enter syscalls, so you can just remove them safely.

Ah indeed, thanks for letting me know! Killed those two lines.

-- 
Jens Axboe