On 1/25/20 10:54 PM, Andres Freund wrote: > Hi, > > On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote: >> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman: >>> From: Jens Axboe <axboe@xxxxxxxxx> >>> >>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream. >>> >>> If the credentials or the mm doesn't match, don't allow the task to >>> submit anything on behalf of this ring. The task that owns the ring can >>> pass the file descriptor to another task, but we don't want to allow >>> that task to submit an SQE that then assumes the ring mm and creds if >>> it needs to go async. >>> >>> Cc: stable@xxxxxxxxxxxxxxx >>> Suggested-by: Stefan Metzmacher <metze@xxxxxxxxx> >>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> >>> >>> >>> --- >>> fs/io_uring.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned >>> wake_up(&ctx->sqo_wait); >>> submitted = to_submit; >>> } else if (to_submit) { >>> + if (current->mm != ctx->sqo_mm || >>> + current_cred() != ctx->creds) { >>> + ret = -EPERM; >>> + goto out; >>> + } >>> + >> >> I thought about this a bit more. >> >> I'm not sure if this is actually to restrictive, >> because it means applications like Samba won't >> be able to use io-uring at all. > > Yea, I think it is too restrictive. In fact, it broke my WIP branch to > make postgres use io_uring. > > > Postgres uses a forked process model, with all sub-processes forked off > one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED > memory (buffer pool, locks, and lots of other IPC). My WIP branch so far > has postmaster create a number of io_urings that then the different > processes can use (with locking if necessary). > > In plenty of the cases it's fairly important for performance to not > require an additional context switch initiate IO, therefore we cannot > delegate submitting to an io_uring to separate process. But it's not > feasible to have one (or even two) urings for each process either: For > one, that's just about guaranteed to bring us over the default > RLIMIT_MEMLOCK limit, and requiring root only config changes is not an > option for many (nor user friendly). > > > Not sharing queues makes it basically impossible to rely on io_uring > ordering properties when operation interlock is needed. E.g. to > guarantee that the journal is flushed before some data buffer can be > written back, being able to make use of links and drains is great - but > there's one journal for all processes. To be able to guarantee anything, > all the interlocked writes need to go through one io_uring. I've not yet > implemented this, so I don't have numbers, but I expect pretty > significant savings. > > > Not being able to share urings also makes it harder to resolve > deadlocks: > > As we call into both library and user defined code, we cannot guarantee > that a specific backend process will promptly (or at all, when waiting > for some locks) process cqes. There's also sections where we don't want > to constantly check for ready events, for performance reasons. But > operations initiated by a process might be blocking other connections: > > E.g. process #1 might have initiated transferring a number of blocks > into postgres' buffer pool via io_uring , and now is busy processing the > first block that completed. But now process #2 might need one of the > buffers that had IO queued, but didn't complete in time for #1 to see > the results. The way I have it set up right now, #2 simply can process > pending cqes in the relevant queue. Which, in this example, would mark > the pg buffer pool entry as valid, allowing #2 to continue. > > Now, completions can still be read by all processes, so I could continue > to do the above: But that'd require all potentially needed io_urings to > be set up in postmaster, before the first fork, and all processes to > keep all those FDs open (commonly several hundred). Not an attractive > option either, imo. > > Obviously we could solve this by having a sqe result processing thread > running within each process - but that'd be a very significant new > overhead. And it'd require making some non-threadsafe code threadsafe, > which I do not relish tackling as a side-effect of io_uring adoption. > > > It also turns out to be nice from a performance / context-switch rate > angle to be able to process cqes for submissions by other > processes. Saves an expensive context switch, and often enough it really > doesn't matter which process processes the completion (especially for > readahead). And in other cases it's cheap to just schedule the > subsequent work from the cqe processor, e.g. initiating readahead of a > few more blocks into the pg buffer pool. Similarly, there are a few > cases where it's useful for several processes to submit IO into a uring > primarily drained by one specific process, to offload the subsequent > action, if that's expensive > > > Now, I think there's a valid argument to be made that postgres should > just use threads, and not be hampered by any of this. But a) that's not > going to happen all that soon, it's a large change, b) it's far from > free from problems either, especially scalability on larger machines, > and robustness. > > >> As even if current_cred() and ctx->creds describe the same >> set of uid,gids the != won't ever match again and >> makes the whole ring unuseable. > > Indeed. It also seems weird that a sqpoll now basically has different > semantics, allowing the io_uring to be used by multiple processes - a > task with a different mm can still wake the sqpoll thread up, even. > > Since the different processes attached still can still write to the > io_uring mmaped memory, they can still queue sqes, they just can't > initiate the processing. But the next time the creator of the uring > submits, they will still be - and thus it seems that the kernel needs to > handle this safely. So I really don't get what this actually achieves? > Am I missing something here? Thanks for your detailed reported, I'm going to revert this change for 5.5. -- Jens Axboe