Re: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring

Jens Axboe <axboe@xxxxxxxxx> · Thu, 30 May 2024 18:11:35 -0600

On 5/30/24 1:35 PM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote:
>> On 5/30/24 11:58 AM, Kent Overstreet wrote:
>>> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
>>>> I have addressed it several times in the past. tldr is that yeah the
>>>> initial history of io_uring wasn't great, due to some unfortunate
>>>> initial design choices (mostly around async worker setup and
>>>> identities).
>>>
>>> Not to pick on you too much but the initial history looked pretty messy
>>> to me - a lot of layering violations - it made aio.c look clean.
>>
>> Oh I certainly agree, the initial code was in a much worse state than it
>> is in now. Lots of things have happened there, like splitting things up
>> and adding appropriate layering. That was more of a code hygiene kind of
>> thing, to make it easier to understand, maintain, and develop.
>>
>> Any new subsystem is going to see lots of initial churn, regardless of
>> how long it's been developed before going into upstream. We certainly
>> had lots of churn, where these days it's stabilized. I don't think
>> that's unusual, particularly for something that attempts to do certain
>> things very differently. I would've loved to start with our current
>> state, but I don't think years of being out of tree would've completely
>> solved that. Some things you just don't find until it's in tree,
>> unfortunately.
> 
> Well, the main thing I would've liked is a bit more discussion in the
> early days of io_uring; there are things we could've done differently
> back then that could've got us something cleaner in the long run.

No matter how much discussion would've been had, there always would've
been compromises or realizations that "yeah that thing there should've
been done differently". In any case, pointless to argue about that, as
the only thing we can change is how things look in the present and
future. At least I don't have a time machine.

> My main complaints were always
>  - yet another special purpose ringbuffer, and
>  - yet another parallel syscall interface.

Exactly how many "parallel syscall interfaces" do we have?

> We've got too many of those too (aio is another), and the API

Like which ones? aio is a special case async interface for O_DIRECT IO,
that's about it. It's not a generic IO interface. It's literally dio
only. And yes, then it has the option of syncing a file, and poll got
added some years ago as well. But for the longest duration of aio, it
was just dio aio. The early versions of io_uring actually added on top of
that, but I didn't feel like it belonged there.

> fragmentation is a real problem for userspace that just wants to be able
> to issue arbitrary syscalls asynchronously. IO uring could've just been
> serializing syscall numbers and arguments - that would have worked fine.

That doesn't work at all. If all syscalls had been designed with
issue + wait semantics, then yeah that would obviously be the way that
it would've been done. You just add all of them, and pass arguments,
done. But that's not reality. You can do that if you just offload to a
worker thread, but that's not how you get performance. And you could
very much STILL do just that, in fact it'd be trivial to wire up. But
nobody would use it, because something that just always punts to a
thread would be awful for performance. You may as well just do that
offload in userspace then.

Hence the majority of the work for wiring up operations that implement
existing functionality in an async way is core work. The io_uring
interface for it is the simplest part, once you have the underpinnings
doing what you want. Sometimes that's some ugly "this can't block, if it
does, return -EAGAIN", and sometimes it's refactoring things a bit so
that you can tap into the inevitable waitqueue. There's no single
recipe, it all depends on how it currently works.

> Given the history of failed AIO replacements just wanting to shove in
> something working was understandable, but I don't think those would have
> been big asks.

What are these failed AIO replacements? aio is for storage, io_uring was
never meant to be a storage only interface. The only other attempt I can
recall, back in the day, was the acall and threadlet stuff that Ingo and
zab worked on. And even that attempted to support async in a performant
way, by doing work inline whenever possible. But hard to use, as you'd
return as a different pid if the original task blocked.

>>> I'd also really like to see some more standardized mechanisms for "I'm a
>>> kernel thread doing work on behalf of some other user thread" - this
>>> comes up elsewhere, I'm talking with David Howells right now about
>>> fsconfig which is another place it is or will be coming up.
>>
>> That does exist, and it came from the io_uring side of needing exactly
>> that. This is why we have create_io_thread(). IMHO it's the only sane
>> way to do it, trying to guesstimate what happens deep down in a random
>> callstack, and setting things up appropriately, is impossible. This is
>> where most of the earlier day io_uring issues came from, and what I
>> referred to as a "poor initial design choice".
> 
> Thanks, I wasn't aware of that - that's worth highlighting. I may switch
> thread_with_file to that, and the fsconfig work David and I are talking
> about can probably use it as well.
> 
> We really should have something lighter weight that we can use for work
> items though, that's our standard mechanism for deferred work, not
> spinning up a whole kthread. We do have kthread_use_mm() - there's no
> reason we couldn't do an expanded version of that for all the other
> shared resources that need to be available.

Like io-wq does for io_uring? That's why it's there. io_uring tries not
to rely on it very much, it's considered the slow path for the above
mentioned reasons of why thread offload generally isn't a great idea.
But at least it doesn't require a full fork for each item.

> This was also another blocker in the other aborted AIO replacements, so
> reusing clone flags does seem like the most reasonable way to make
> progress here, but I would wager there's other stuff in task_struct that
> really should be shared and isn't. task_struct is up to 825 (!) lines
> now, which means good luck on even finding that stuff - maybe at some
> point we'll have to get a giant effort going to clean up and organize
> task_struct, like willy's been doing with struct page.

Well that thing is an unwieldy beast and has been for many years. So
yeah, very much agree that it needs some tender love and care, and we'd
be better off for it.

>>>> Those have since been rectified, and the code base is
>>>> stable and solid these days.
>>>
>>> good tests, code coverage analysis to verify, good syzbot coverage?
>>
>> 3x yes. Obviously I'm always going to say that tests could be better,
>> have better coverage, cover more things, because nothing is perfect (and
>> if you think it is, you're fooling yourself) and as a maintainer I want
>> perfect coverage. But we're pretty diligent these days about adding
>> tests for everything. And any regression or bug report always gets test
>> cases written.
> 
> *nod* that's encouraging. Looking forward to the umount issue being
> fixed so I can re-enable it in my tests...

I'll pick it up again soon enough, I'll let you know.

-- 
Jens Axboe