On 03/06/2017 09:59 AM, Avi Kivity wrote: > > > On 03/06/2017 06:08 PM, Jens Axboe wrote: >> On 03/06/2017 08:59 AM, Avi Kivity wrote: >>> On 03/06/2017 05:38 PM, Jens Axboe wrote: >>>> On 03/06/2017 08:29 AM, Avi Kivity wrote: >>>>> On 03/06/2017 05:19 PM, Jens Axboe wrote: >>>>>> On 03/06/2017 01:25 AM, Jan Kara wrote: >>>>>>> On Sun 05-03-17 16:56:21, Avi Kivity wrote: >>>>>>>>> The goal of the patch series is to return -EAGAIN/-EWOULDBLOCK if >>>>>>>>> any of these conditions are met. This way userspace can push most >>>>>>>>> of the write()s to the kernel to the best of its ability to complete >>>>>>>>> and if it returns -EAGAIN, can defer it to another thread. >>>>>>>>> >>>>>>>> Is it not possible to push the iocb to a workqueue? This will allow >>>>>>>> existing userspace to work with the new functionality, unchanged. Any >>>>>>>> userspace implementation would have to do the same thing, so it's not like >>>>>>>> we're saving anything by pushing it there. >>>>>>> That is not easy because until IO is fully submitted, you need some parts >>>>>>> of the context of the process which submits the IO (e.g. memory mappings, >>>>>>> but possibly also other credentials). So you would need to somehow transfer >>>>>>> this information to the workqueue. >>>>>> Outside of technical challenges, the API also needs to return EAGAIN or >>>>>> start blocking at some point. We can't expose a direct connection to >>>>>> queue work like that, and let any user potentially create millions of >>>>>> pending work items (and IOs). >>>>> You wouldn't expect more concurrent events than the maxevents parameter >>>>> that was supplied to io_setup syscall; it should have reserved any >>>>> resources needed. >>>> Doesn't matter what limit you apply, my point still stands - at some >>>> point you have to return EAGAIN, or block. Returning EAGAIN without >>>> the caller having flagged support for that change of behavior would >>>> be problematic. >>> Doesn't it already return EAGAIN (or some other error) if you exceed >>> maxevents? >> It's a setup thing. We check these limits when someone creates an IO >> context, and carve out the specified entries form our global pool. Then >> we free those "resources" when the io context is freed. >> >> Right now I can setup an IO context with 1000 entries on it, yet that >> number has NO bearing on when io_submit() would potentially block or >> return EAGAIN. >> >> We can have a huge gap on the intent signaled by io context setup, and >> the reality imposed by what actually happens on the IO submission side. > > Isn't that a bug? Shouldn't that 1001st incomplete io_submit() return > EAGAIN? > > Just tested it, and maxevents is not respected for this: > > io_setup(1, [0x7fc64537f000]) = 0 > io_submit(0x7fc64537f000, 10, [{pread, fildes=3, buf=0x1eb4000, > nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, > offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, > {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, > fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, > buf=0x1eb4000, nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, > nbytes=4096, offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, > offset=0}, {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}, > {pread, fildes=3, buf=0x1eb4000, nbytes=4096, offset=0}]) = 10 > > which is unexpected, to me. ioctx_alloc() { [...] /* * We keep track of the number of available ringbuffer slots, to prevent * overflow (reqs_available), and we also use percpu counters for this. * * So since up to half the slots might be on other cpu's percpu counters * and unavailable, double nr_events so userspace sees what they * expected: additionally, we move req_batch slots to/from percpu * counters at a time, so make sure that isn't 0: */ nr_events = max(nr_events, num_possible_cpus() * 4); nr_events *= 2; } -- Jens Axboe