Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 12 Nov 2024 00:54:32 +0000

On 11/10/24 17:41, Kanchan Joshi wrote:
On 11/7/2024 10:53 PM, Pavel Begunkov wrote:

1. SQE128 makes it big for all requests, intermixing with requests that
don't need additional space wastes space. SQE128 is fine to use but at
the same time we should be mindful about it and try to avoid enabling it
if feasible.

Right. And initial versions of this series did not use SQE128. But as we
moved towards passing more comprehensive PI information, first SQE was
not enough. And we thought to make use of SQE128 rather than taking
copy_from_user cost.

Do we have any data how expensive it is? I don't think I've ever
tried to profile it. And where the overhead comes from? speculation
prevention?

We did measure this for nvme passthru commands in past (and that was the
motivation for building SQE128). Perf profile showed about 3% overhead
for copy [*].

Interesting. Sounds like the 3% is not accounting spec barriers,
and then I'm a bit curious how much of it comes from the generic
memcpy what could've been several 64 bit reads. But regardless
let's assume it is expensive.

If it's indeed costly, we can add sth to io_uring like pre-mapping
memory to optimise it, which would be useful in other places as
well.

But why to operate as if SQE128 does not exist?
Reads/Writes, at this point, are clearly not using aboud 20b in first
SQE and entire second SQE. Not using second SQE at all does not seem
like the best way to protect it from being used by future users.

You missed the point, if you take another look at the rest of my
reply I even mentioned that SQE128 could be used as an optimisation
and the only mode for this patchset, but the API has to be nicely
extendable with more attributes in the future.

You can't fit everything into SQE128. Even if we grow the SQE size
further, it's one size for all requests, mixing requests would mean
initilising entire SQE256/512/... for all requests, even for those
that don't need it. It might be reasonable for some applications
but not for a generic case.

I know you care about having that particular integrity feature,
but it'd be bad for io_uring to lock into a suboptimal API and
special-casing PI implementation. Let's shift a discussion about
details to the other sub-thread.

Pre-mapping maybe better for opcodes for which copy_for_user has already
been done. For something new (like this), why to start in a suboptimal
way, and later, put the burden of taking hoops on userspace to get to
the same level where it can get by simply passing a flag at the time of
ring setup.

--
Pavel Begunkov