Re: [LSF/MM/BPF TOPIC] Replacing TASK_(UN)INTERRUPTIBLE with regions of uninterruptibility

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sat, 3 Feb 2024 12:27:26 -0500

On Fri, Feb 02, 2024 at 04:23:46PM +0000, Al Viro wrote:
> On Fri, Feb 02, 2024 at 11:22:15AM +0000, David Howells wrote:
> > Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> > 
> > > Just making inode_lock() interruptible would break everything.
> > 
> > Why?  Obviously, you'd need to check the result of the inode_lock(), which I
> > didn't put in my very rough example code, but why would taking the lock at the
> > front of a vfs op like mkdir be a problem?
> 
> Plenty of new failure exits to maintain?

I don't currently see a reason to go around converting existing
uninterruptible sleeps; the main benefit of the proposal as I see it
would be that we could mark sleeps as either interruptible or killable
correctly, since that really depends on what syscall we're in and what
userspace is expecting. If kernel code can correctly do one it can do
both, so this is a pretty straightforward change.

But it is an interesting idea, I'd be curious to see what comes out of
playing around with some refactorings.

There's some other wait_event() related ideas kicking around too...

Willy and Dave and I were talking about the "asynchronous waits" that
io_uring is wanting to do - I believe this is currently just done in an
ad-hoc way for waiting on a folio lock.

It seemed like it might be possible to do this in a more generic way by
simply dynamically allocating the waitlist entry, and signalling via
task_struct the wait/wakeup should be delivered to a kiocb, instead of
to a thread.

Another thing I've been wanting to do is embed a sequence number in
wait_queue_head_t, which would be incremented on wakeup. This would
change prepare_to_wait() to "read current sequence number", then later
we sleep until the sequence number has changed from what we initially
read.

This would let us fix double expansion of the wait condition in the
wait_event() macros, and it would also mean we're not flipping task
state before running the cond expression...