On 11/20/24 17:35, Darrick J. Wong wrote:
On Fri, Nov 15, 2024 at 06:04:01PM +0000, Matthew Wilcox wrote:
On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
of whether a particular request needs it or not, and the user will need
to zero them for each request.
The way we handled this in NVMe was to use a bit in the command that
was called (iirc) FUSED, which let you use two consecutive entries for
a single command.
Some variant on that could surely be used for io_uring. Perhaps a
special opcode that says "the real opcode is here, and this is a two-slot
command". Processing gets a little spicy when one slot is the last in
the buffer and the next is the the first in the buffer, but that's a SMOP.
I like willy's suggestion -- what's the difficulty in having a SQE flag
that says "...and keep going into the next SQE"? I guess that
introduces the problem that you can no longer react to the observation
of 4 new SQEs by creating 4 new contexts to process those SQEs and throw
all 4 of them at background threads, since you don't know how many IOs
are there.
Some variation on "variable size SQE" was discussed back in the day
as an option instead of SQE128. I don't remember why it was refused
exactly, but I'd think it was exactly the "spicy" moment Matthew
mentioned, especially since nvme passthrough was spanning its payload
across both parts of the SQE.
I'm pretty sure I can find more than a couple of downsides, like for
it to be truly generic you need a flag in each SQE and finding a bit
is not that easy, and also in terms of some overhead to everyone else
while this extension is not even needed. By the end of the day, the
main concern is how it's placed and not where specifically,
SQE / user memory / etc.
--
Pavel Begunkov