On Fri, Dec 11, 2020 at 9:21 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > Explain, please. What's the difference between blocking in a lookup and > blocking in truncate? Either your call site is fine with a potentially > long sleep, or it is not; I don't understand what makes one source of > that behaviour different from another. So I'm not Jens, and I don't know exactly what io_uring loads he's looking at, but the reason I'm interested in this is that this is very much not the first time this has come up. The big difference between filename lookup and truncate is that one is very common indeed, and the other isn't. Sure, something like truncate happens. And it might even be a huge deal and very critical for some load. But realistically, I don't think I've ever seen a load where if it's important, and you can do it asynchronously, you couldn't just start a thread for it (particularly a kthread). > "Fast path" in context like "we can't sleep here, but often enough we > won't need to; here's a function that will bail out rather than blocking, > let's call that and go through offload to helper thread in rare case > when it does bail out" does make sense; what you are proposing to do > here is rather different and AFAICS saying "that's my fast path" is > meaningless here. The fast path context here is not "we can't sleep here". No, the fast-path context here is "we want highest performance here", with the understanding that there are other things to be done. The existing code already simply starts a kernel thread for the open - not because it "can't sleep", but because of that "I want to get this operation started, but there are other things I want to start too". And in that context, it's not about "can't sleep". It's about "if we already have the data in a very fast cache, then doing this asynchronously with a thread is SLOWER than just doing it directly". In particular it's not about correctness: doing it synchronously or asynchronously are both "equally correct". You get the same answer in the end. It's purely about that "if we can do it really quickly, it's better to just do it". Which gets me back to the first part: this has come up before. Tux2 used to want to do _exactly_ this same thing. But what has happened is that (a) we now have a RCU lookup that is an almost exact match for this and (b) we now have a generic interface for user space to use it in the form of io_uring So this is not about "you have to get it right". In fact, if it was, the RCU lookup model would be the wrong thing, because the RCU name lookup is optimistic, and will fail for a variety of reasons. Bo, this is literally about "threads and synchronization is a real overhead, so if you care about performance, you don't actually want to use them if you can do the operation so fast that the thread and synchronization overhead is a real issue". Which is why LOOKUP_RCU is such a good match. And while Tux was never very successful because it was so limited and so special, io_uring really looks like it could be the interface to make a lot of performance-sensitive people happy. And getting that "low-latency cached behaviour vs bigger operations that might need lots of locks or IO" balance right would be a very good thing, I suspect. Linus