Re: [PATCH] pipe: fix hang when racing with a wakeup

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 1 Oct 2020 13:41:33 -0400

On 10/1/20 1:28 PM, Linus Torvalds wrote:
On Thu, Oct 1, 2020 at 8:58 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes.  In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule.

Good catch of an interesting problem.

That said, the real problem here is that "pipe_wait()" is some nasty
garbage. I really wanted to convert all users to do a proper
wait-queue usage, but left the old structure alone.

Any normal "prepare_to_wait()" user should always check for the
condition that it's waiting for _after_ the prepare-to-wait call, but
the pipe code was traditionally all run under the pipe mutex, so it
should have no races at all, because it's completely invalid to use
"pipe_wait()" with anything that doesn't hold the mutex (both on the
sleeping and waking side).

So pipe_wait() is kind of legacy and meant for all the silly and
complex UNIX domain connection things that nobody wants to touch.

The IO code was supposed to have been converted away from that pipe
mutex model, but it looks like I punted on splice, without thinking
about this issue.

So I think the *real* fix would be to convert the splice waiting code
to work even without holding the pipe mutex. Because honestly, I think
your patch fixes the problem, but not completely.

In particular, the pipe wakeup can happen from somebody that doesn't
hold the pipe mutex at all (see pipe_read(), and notice how it's doing
the __pipe_unlock() before it does the "if it was full, wake things
up), so this whole sequence is racy:

    check if pipe is full
    pipe_wait() if it is

because the wakeup can happen in between those things, and
"pipe_wait()" has no way to know what it's waiting for, and the wakeup
event has already happened by the time it's called.

Yes, sorry that's what I meant to convey in my changelog, the wakeup has to 
occur outside the pipe_lock for this to occur.

Generally I think this is the only real problem we have, because as you said we 
only modify the tail/head under the pipe_lock, so it's not going to change while 
we do our pipe_wait().  The problem happens where we get removed for a case that 
no longer applies to us.  My fix "fixes" the problem in that we'll get a 
spurious wakeup, see that our condition isn't met, and go back to sleep, and 
then the next writer will wake us up for real.

Obviously not ideal, but I figured the simpler fix was better for stable, and 
then we could work out something better.

Let me think about this, but I think the right solution is to just get
rid of the splice use of pipe_wait.

Yeah we could just have the callers wait on the waitqueue they actually care 
about, that would be a simple solution that would also be cleaner.

Do you have some half-way reliable way to trigger the issue for testing?

I do not, but I also didn't try.  We basically just have to break out of the 
pipe_write() loop before waking anything up.  I'll rig up something that just 
writes one page at a time, and the other side splices into /dev/null, and just 
run that in a loop for a few minutes and see if I can trigger it more reliably 
than my btrfs send thing.  Thanks,

Joesf