On 6/16/20 12:45 PM, Jens Axboe wrote: > On 6/16/20 7:58 AM, Jens Axboe wrote: >> On 6/15/20 9:04 PM, Jiufei Xue wrote: >>> >>> >>> On 2020/6/15 下午11:09, Jens Axboe wrote: >>>> On 6/14/20 8:49 PM, Jiufei Xue wrote: >>>>> Hi Jens, >>>>> >>>>> On 2020/6/13 上午12:48, Jens Axboe wrote: >>>>>> On 6/12/20 8:58 AM, Jens Axboe wrote: >>>>>>> On 6/11/20 8:30 PM, Jiufei Xue wrote: >>>>>>>> poll events should be 32-bits to cover EPOLLEXCLUSIVE. >>>>>>>> >>>>>>>> Signed-off-by: Jiufei Xue <jiufei.xue@xxxxxxxxxxxxxxxxx> >>>>>>>> --- >>>>>>>> fs/io_uring.c | 4 ++-- >>>>>>>> include/uapi/linux/io_uring.h | 2 +- >>>>>>>> tools/io_uring/liburing.h | 2 +- >>>>>>>> 3 files changed, 4 insertions(+), 4 deletions(-) >>>>>>>> >>>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>>> index 47790a2..6250227 100644 >>>>>>>> --- a/fs/io_uring.c >>>>>>>> +++ b/fs/io_uring.c >>>>>>>> @@ -4602,7 +4602,7 @@ static void io_poll_queue_proc(struct file *file, struct wait_queue_head *head, >>>>>>>> static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>>>>>> { >>>>>>>> struct io_poll_iocb *poll = &req->poll; >>>>>>>> - u16 events; >>>>>>>> + u32 events; >>>>>>>> >>>>>>>> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL)) >>>>>>>> return -EINVAL; >>>>>>>> @@ -8196,7 +8196,7 @@ static int __init io_uring_init(void) >>>>>>>> BUILD_BUG_SQE_ELEM(28, /* compat */ int, rw_flags); >>>>>>>> BUILD_BUG_SQE_ELEM(28, /* compat */ __u32, rw_flags); >>>>>>>> BUILD_BUG_SQE_ELEM(28, __u32, fsync_flags); >>>>>>>> - BUILD_BUG_SQE_ELEM(28, __u16, poll_events); >>>>>>>> + BUILD_BUG_SQE_ELEM(28, __u32, poll_events); >>>>>>>> BUILD_BUG_SQE_ELEM(28, __u32, sync_range_flags); >>>>>>>> BUILD_BUG_SQE_ELEM(28, __u32, msg_flags); >>>>>>>> BUILD_BUG_SQE_ELEM(28, __u32, timeout_flags); >>>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >>>>>>>> index 92c2269..afc7edd 100644 >>>>>>>> --- a/include/uapi/linux/io_uring.h >>>>>>>> +++ b/include/uapi/linux/io_uring.h >>>>>>>> @@ -31,7 +31,7 @@ struct io_uring_sqe { >>>>>>>> union { >>>>>>>> __kernel_rwf_t rw_flags; >>>>>>>> __u32 fsync_flags; >>>>>>>> - __u16 poll_events; >>>>>>>> + __u32 poll_events; >>>>>>>> __u32 sync_range_flags; >>>>>>>> __u32 msg_flags; >>>>>>>> __u32 timeout_flags; >>>>>>> >>>>>>> We obviously have the space in there as most other flag members are 32-bits, but >>>>>>> I'd want to double check if we're not changing the ABI here. Is this always >>>>>>> going to be safe, on any platform, regardless of endianess etc? >>>>>> >>>>>> Double checked, and as I feared, we can't safely do this. We'll have to >>>>>> do something like the below, grabbing an unused bit of the poll mask >>>>>> space and if that's set, then store the fact that EPOLLEXCLUSIVE is set. >>>>>> So probably best to turn this just into one patch, since it doesn't make >>>>>> a lot of sense to do it as a prep patch at that point. >>>>>> >>>>> Yes, Agree about that. But I also fear that if the unused bit is used >>>>> in the feature, it will bring unexpected behavior. >>>> >>>> Yeah, it's certainly not the prettiest and could potentially be fragile. >>>> I'm open to suggestions, we need some way of signaling that the 32-bit >>>> variant of the poll_events should be used. We could potentially make >>>> this work by doing explicit layout for big endian vs little endian, that >>>> might be prettier and wouldn't suffer from the "grab some random bit" >>>> issue. >>>> >>> Thank you for your suggestion, I will think about it. >>> >>>>>> This does have the benefit of not growing io_poll_iocb. With your patch, >>>>>> it'd go beyond a cacheline, and hence bump the size of the entire >>>>>> io_iocb as well, which would be very unfortunate. >>>>>> >>>>> events in io_poll_iocb is 32-bits already, so why it will bump the >>>>> size of the io_iocb structure with my patch? >>>> >>>> It's not 32-bits already, it's a __poll_t type which is 16-bits only. >>>> >>> Yes, it is a __poll_t type, but I found that __poll_t type is 32-bits >>> with the definition below: >>> >>> typedef unsigned __bitwise __poll_t; >>> >>> And I also investigate it with crash: >>> crash> io_poll_iocb -ox >>> struct io_poll_iocb { >>> [0x0] struct file *file; >>> union { >>> [0x8] struct wait_queue_head *head; >>> [0x8] u64 addr; >>> }; >>> [0x10] __poll_t events; >>> [0x14] bool done; >>> [0x15] bool canceled; >>> [0x18] struct wait_queue_entry wait; >>> } >> >> Yeah you're right, not sure why I figured it was 16-bits. But just >> checking on my default build: >> >> axboe@x1 ~/gi/linux-block (block-5.8)> pahole -C io_poll_iocb fs/io_uring.o >> struct io_poll_iocb { >> struct file * file; /* 0 8 */ >> union { >> struct wait_queue_head * head; /* 8 8 */ >> u64 addr; /* 8 8 */ >> }; /* 8 8 */ >> __poll_t events; /* 16 4 */ >> bool done; /* 20 1 */ >> bool canceled; /* 21 1 */ >> >> /* XXX 2 bytes hole, try to pack */ >> >> struct wait_queue_entry wait; /* 24 40 */ >> >> /* size: 64, cachelines: 1, members: 6 */ >> /* sum members: 62, holes: 1, sum holes: 2 */ >> }; >> >> and it's definitely 64-bytes in total size (as it should be), and the >> 'events' is indeed 32-bits as well. >> >> So at least that's good, we don't need anything extra in there. If we >> can solve the endian issue, then it should be trivial to use the full >> 32-bits for the flags in the sqe. > > To get back to something that can move us forward, something like the > below I _think_ will work. Then applications just use poll32_events and > we don't have to handle this in any special way. Care to base on that > and re-send the change? > > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 92c22699a5a7..5b3f6bd59437 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -31,7 +31,16 @@ struct io_uring_sqe { > union { > __kernel_rwf_t rw_flags; > __u32 fsync_flags; > - __u16 poll_events; > + struct { > +#ifdef __BIG_ENDIAN_BITFIELD > + __u16 __unused_poll; > + __u16 poll_events; > +#else > + __u16 poll_events; > + __u16 __unused_poll; > +#endif > + }; > + __u32 poll32_events; > __u32 sync_range_flags; > __u32 msg_flags; > __u32 timeout_flags; > Nah this won't work, as the BE layout will be different. So how about this, just add a 16-bit poll_events_hi instead. App/liburing will set upper bits there. Something like the below, then just needs the exclusive wait change on top. Only downside I can see is that newer applications on older kernels will set EPOLLEXCLUSIVE but the kernel will ignore it. That's not a huge concern for this particular bit, but could be a concern if there are others that prove useful. diff --git a/fs/io_uring.c b/fs/io_uring.c index de1175206807..a9d74330ad6b 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -4809,6 +4809,9 @@ static int io_poll_add_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe events = READ_ONCE(sqe->poll_events); poll->events = demangle_poll(events) | EPOLLERR | EPOLLHUP; + if (READ_ONCE(sqe->poll_events_hi) & EPOLLEXCLUSIVE) + poll->events |= EPOLLEXCLUSIVE; + io_get_req_task(req); return 0; } diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 92c22699a5a7..e6856d8e068f 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -31,7 +31,10 @@ struct io_uring_sqe { union { __kernel_rwf_t rw_flags; __u32 fsync_flags; - __u16 poll_events; + struct { + __u16 poll_events; + __u16 poll_events_hi; + }; __u32 sync_range_flags; __u32 msg_flags; __u32 timeout_flags; -- Jens Axboe