On Thu, 2021-05-06 at 04:42 -0400, Olivier Langlois wrote: > On Wed, 2021-05-05 at 23:17 -0400, Olivier Langlois wrote: > > Note that the poll remove sqe and the following poll add sqe don't > > have > > exactly the same user_data. > > > > I have this statement in between: > > /* increment generation counter to avoid handling old events */ > > ++anfds [fd].egen; > > > > poll remove cancel the previous poll add having gen 1 in its user > > data. > > the next poll add has it user_data storing gen var set to 2: > > > > 1 3 remove 85 1 > > 1 3 add 85 2 > > > > 85 gen 1 res -125 > > 85 gen 1 res 4 > > > Good news! > > I have used the io_uring tracepoints and they confirm that io_uring > works as expected: > > For the above printfs, I get the following perf traces: > > 11940.259 Execution SVC/134675 io_uring:io_uring_submit_sqe(ctx: > 0xffff9d3c4368c000, opcode: 7, force_nonblock: 1) > 11940.270 Execution SVC/134675 io_uring:io_uring_complete(ctx: > 0xffff9d3c4368c000, user_data: 4294967382, res: -125) > 11940.272 Execution SVC/134675 io_uring:io_uring_complete(ctx: > 0xffff9d3c4368c000) > 11940.275 Execution SVC/134675 io_uring:io_uring_file_get(ctx: > 0xffff9d3c4368c000, fd: 86) > 11940.277 Execution SVC/134675 io_uring:io_uring_submit_sqe(ctx: > 0xffff9d3c4368c000, opcode: 6, user_data: 4294967382, force_nonblock: > 1) > 11940.279 Execution SVC/134675 io_uring:io_uring_complete(ctx: > 0xffff9d3c4368c000, user_data: 4294967382, res: 4) > > So, it seems the compiler is playing games on me. It prints the correct > gen 2 value but is passing 1 to io_uring_sqe_set_data()... > > I'll try to turn optimization off to see if it helps. > > thx a lot again for your exceptional work! > > The more I fool around trying to find the problem, the more I think that my problem is somewhere in the liburing (v2.0) code because of a possibly missing memory barrier. The function that I do use to submit the sqes is io_uring_wait_cqe_timeout(). My problem did appear when I did replace libev original boilerplate code for liburing (v2.0) used for filling and submitting the sqes. Do you remember when you pointed out that I wasn't setting the user_data field for my poll remove request in: io_uring_prep_poll_remove(sqe, iouring_build_user_data(IOURING_POLL, fd, anfds [fd].egen)); // io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_POLL, fd, anfds [fd].egen)); ? The last call to remove the non-existant 'add 85 2' is replied by ENOENT by io_uring and this was caught by an assert in my case IOURING_POLL cqe handler. iouring_decode_user_data(cqe->user_data, &type, &fd, &gen); switch (type) { This is impossible to end up there because if you do not explicitly set user_data, io_uring_prep_rw() is setting it to 0. In order for my assert to be hit, user_data would have to be set with the commented out io_uring_sqe_set_data(), and it happens to also be the value of the previously sent sqe user_data... I did replace the code above with: io_uring_prep_poll_remove(sqe, iouring_build_user_data(IOURING_POLL_ADD, fd, anfds [fd].egen)); io_uring_sqe_set_data(sqe, iouring_build_user_data(IOURING_POLL_REMOVE, fd, anfds [fd].egen)); and my assert for cqe->res != -ENOENT has stopped being hit. There is clearly something messing with the sqe user_data communication between user and kernel and I start to suspect that this might be somewhere inside io_uring_wait_cqe_timeout()...