Re: [PATCH RFC 9/9] io_uring: Introduce IORING_OP_EXEC command

Josh Triplett <josh@xxxxxxxxxxxxxxxx> · Tue, 10 Dec 2024 13:01:57 -0800

On Mon, Dec 09, 2024 at 06:43:11PM -0500, Gabriel Krisman Bertazi wrote:
> From: Josh Triplett <josh@xxxxxxxxxxxxxxxx>
> 
> This command executes the equivalent of an execveat(2) in a previously
> spawned io_uring context, causing the execution to return to a new
> program indicated by the SQE.
> 
> As an io_uring command, it is special in a few ways, requiring some
> quirks. First, it can only be executed from the spawned context linked
> after the IORING_OP_CLONE command; In addition, the first successful
> IORING_OP_EXEC command will terminate the link chain, causing
> further operations to fail with -ECANCELED.
> 
> There are a few reason for the first limitation: First, it wouldn't make
> much sense to execute IORING_OP_EXEC in an io-wq, as it would simply
> mean "stealing" the worker thread from io_uring; It would also be
> questionable to execute inline or in a task work, as it would terminate
> the execution of the ring.  Another technical reason is that we'd
> immediately deadlock (fixable), because we'd need to complete the
> command and release the reference after returning from the execve, but
> the context has already been invalidated by terminating the process.
> All in all, considering io_uring's purpose to provide an asynchronous
> interface, I'd (Gabriel) like to focus on the simple use-case first,
> limiting it to the cloned context for now.

This seems like a reasonable limitation for now. I'd eventually like to
handle things like "install these fds, do some other setup calls, then
execveat" as a ring submission (perhaps as a synchronous one), but
leaving that out for now seems reasonable.

The combination of clone and exec should probably get advertised as a
new capability. If we add exec-without-clone in the future, that can be
a second new capability.

The commit message should probably also document the rationale for dfd
not accepting a ring index (for now) rather than an installed fd. That
*also* seems like a perfectly reasonable limitation for now, just one
that needs documenting.

Otherwise, LGTM, and thank you again for updating this!