On 6/1/22 12:52 PM, Linus Torvalds wrote: > On Wed, Jun 1, 2022 at 11:34 AM Jens Axboe <axboe@xxxxxxxxx> wrote: >> >> But as a first step, let's just mark it deprecated with a pr_warn() for >> 5.20 and then plan to kill it off whenever a suitable amount of relases >> have passed since that addition. > > I'd love to, but it's not actually realistic as things stand now. > epoll() is used in a *lot* of random libraries. A "pr_warn()" would > just be senseless noise, I bet. I mean only for the IORING_OP_EPOLL_CTL opcode, which is the only epoll connection we have in there. It'd be jumping the gun to do it for the epoll_ctl syscall for sure... And I really have no personal skin in that game, other than having a better alternative. But that's obviously a long pole type of deprecation. > No, there's a reason that EPOLL is still there, still 'default y', > even though I dislike it and think it was a mistake, and we've had > several nasty bugs related to it over the years. > > It really can be a very useful system call, it's just that it really > doesn't work the way the actual ->poll() interface was designed, and > it kind of hijacks it in ways that mostly work, but the have subtle > lifetime issues that you don't see with a regular select/poll because > those will always tear down the wait queues. > > Realistically, the proper fix to epoll is likely to make it explicit, > and make files and drivers that want to support it have to actually > opt in. Because a lot of the problems have been due to epoll() looking > *exactly* like a regular poll/select to a driver or a filesystem, but > having those very subtle extended requirements. > > (And no, the extended requirements aren't generally onerous, and > regular ->poll() works fine for 99% of all cases. It's just that > occasionally, special users are then fooled about special contexts). It's not an uncommon approach to make the initial adoption / implementation more palatable, though commonly then also ends up being a mistake. I've certainly been guilty of that myself too... > In other words, it's a bit like our bad old days when "splice()" ended > up falling back to regular ->read()/->write() implementations with > set_fs(KERNEL_DS). Yes, that worked fine for 99% of all cases, and we > did it for years, but it also caused several really nasty issues for > when the read/write actor did something slightly unusual. Unfortunately that particular change I just had to deal with, and noticed that we're up to more than two handfuls of fixes for that and I bet we're not done. Not saying it wasn't the right choice in terms of sanity, but it has been more painful than I thought it would be. > So I may dislike epoll quite intensely, but I don't think we can > *really* get rid of it. But we might be able to make it a bit more > controlled. > > But so far every time it has caused issues, we've worked around it by > fixing it up in the particular driver or whatever that ended up being > triggered by epoll semantics. The io_uring side of the epoll management I'm very sure can go in a few releases, and a pr_warn_once() for 5.20 is the right choice. epoll itself, probably not even down the line, though I am hoping we can continue to move people off of it. Maybe in another 20 years :-) -- Jens Axboe