Hi, On 2020-02-01 14:30:06 +0300, Pavel Begunkov wrote: > On 01/02/2020 12:18, Andres Freund wrote: > > Hi, > > > > Reading the manpage from liburing I read: > > IOSQE_IO_LINK > > When this flag is specified, it forms a link with the next SQE in the submission ring. That next SQE > > will not be started before this one completes. This, in effect, forms a chain of SQEs, which can be > > arbitrarily long. The tail of the chain is denoted by the first SQE that does not have this flag set. > > This flag has no effect on previous SQE submissions, nor does it impact SQEs that are outside of the > > chain tail. This means that multiple chains can be executing in parallel, or chains and individual > > SQEs. Only members inside the chain are serialized. Available since 5.3. > > > > IOSQE_IO_HARDLINK > > Like IOSQE_IO_LINK, but it doesn't sever regardless of the completion result. Note that the link will > > still sever if we fail submitting the parent request, hard links are only resilient in the presence of > > completion results for requests that did submit correctly. IOSQE_IO_HARDLINK implies IOSQE_IO_LINK. > > Available since 5.5. > > > > I can make some sense out of that description of IOSQE_IO_LINK without > > looking at kernel code. But I don't think it's possible to understand > > what happens when an earlier chain member fails, and what denotes an > > error. IOSQE_IO_HARDLINK's description kind of implies that > > IOSQE_IO_LINK will not start the next request if there was a failure, > > but doesn't define failure either. > > > > Right, after a "failure" occurred for a IOSQE_IO_LINK request, all subsequent > requests in the link won't be executed, but completed with -ECANCELED. However, > if IOSQE_IO_HARDLINK set for the request, it won't sever/break the link and will > continue to the next one. I think something along those lines should be added to the manpage... I think severing the link isn't really a good description, because it's not like it's separating off the tail to be independent, or such. If anything it's the opposite. > > Looks like it's defined in a somewhat adhoc manner. For file read/write > > subsequent requests are failed if they are a short read/write. But > > e.g. for sendmsg that looks not to be the case. > > > > As you said, it's defined rather sporadically. We should unify for it to make > sense. I'd prefer to follow the read/write pattern. I think one problem with that is that it's not necessarily useful to insist on the length being the maximum allowed length. E.g. for a recvmsg you'd likely want to not fail the request if you read less than what you provided for, because that's just a normal occurance. It could e.g. be useful to just start the next recv (with a different buffer) immediately. I'm not even sure it's generally sensible for read either, as that doesn't work well for EOF, non-file FDs, ... Perhaps there's just no good solution though. > > Perhaps it'd make sense to reject use of IOSQE_IO_LINK outside ops where > > it's meaningful? > > If we disregard it for either length-based operations or the rest ones (or > whatever combination), the feature won't be flexible enough to be useful, > but in combination it allows to remove much of context switches. I really don't want to make it less useful ;) - In fact I'm pretty excited about having it. I haven't yet implemented / benchmarked that, but I think for databases it is likely to be very good to achieve low but consistent IO queue depths for background tasks like checkpointing, readahead, writeback etc, while still having a low context switch rates. Without something like IOSQE_IO_LINK it's considerably harder to have continuous IO that doesn't impact higher priority IO like journal flushes. Andres Freund