[PATCH] pipe: fix hang when racing with a wakeup

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 1 Oct 2020 11:58:31 -0400

I hit a hang with fstest btrfs/187, which does a btrfs send into
/dev/null.  This works by creating a pipe, the write side is given to
the kernel to write into, and the read side is handed to a thread that
splices into a file, in this case /dev/null.  The box that was hung had
the write side stuck here

[<0>] pipe_write+0x38b/0x600
[<0>] new_sync_write+0x108/0x180
[<0>] __kernel_write+0xd4/0x160
[<0>] kernel_write+0x74/0x110
[<0>] btrfs_ioctl_send+0xb51/0x1867
[<0>] _btrfs_ioctl_send+0xbf/0x100
[<0>] btrfs_ioctl+0x1d4c/0x3090
[<0>] __x64_sys_ioctl+0x83/0xb0
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

and the read side stuck here

[<0>] pipe_wait+0xa4/0x100
[<0>] splice_from_pipe_next.part.0+0x33/0xe0
[<0>] __splice_from_pipe+0x6a/0x200
[<0>] splice_from_pipe+0x50/0x70
[<0>] do_splice+0x35c/0x7e0
[<0>] __x64_sys_splice+0x92/0x110
[<0>] do_syscall_64+0x33/0x40
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using drgn to debug, I found that the pipe rd_wait was empty, but the
wr_wait was not

>>> pipe.rd_wait.head
(struct list_head){
        .next = (struct list_head *)0xffff9d2e5a3ad8d0,
        .prev = (struct list_head *)0xffff9d2e5a3ad8d0,
}
>>> pipe.wr_wait.head
(struct list_head){
        .next = (struct list_head *)0xffffa8934111bd88,
        .prev = (struct list_head *)0xffffa89341143ba8,
}

>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.wr_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...
pid (pid_t)3080640 state (volatile long)1
pid (pid_t)3080638 state (volatile long)1
>>> for e in list_for_each_entry('struct wait_queue_entry',
				 pipe.rd_wait.head.address_of_(), 'entry'):
...   task = Object(prog, 'struct task_struct',
		    address=e.private.value_())
...   print("pid {} state {}".format(task.pid, task.state))
...

The wr_wait has both the writer and the reader waiting, which is
expected, the pipe is full

>>> pipe.head
(unsigned int)179612
>>> pipe.tail
(unsigned int)179596
>>> pipe.max_usage
(unsigned int)16

and the read side is only waiting on wr_wait, rd_wait doesn't have our
entry any more.

This can happen in the following way

WRITER					READER
pipe_write()				splice
  pipe_lock				pipe_lock()
    was_empty = true
    do the write
    break out of the loop
  pipe_unlock
					  consume what was written
					pipe_wait()
					prepare_to_wait(rd_wait)
					  set_task_state(INTERRUPTIBLE)
  wake_up(rd_wait)
    set_task_state(READER, RUNNING)
					prepare_to_wait(wr_wait)
					  set_task_state(INTERRUPTIBLE)
					pip_unlock()
					schedule()

The problem is we're doing the prepare_to_wait, which sets our state
each time, however we can be woken up either with reads or writes.  In
the case above we race with the WRITER waking us up, and re-set our
state to INTERRUPTIBLE, and thus never break out of schedule.

Instead we need to set our state once, and then add ourselves to our
respective waitqueues.  This way if any of our waitqueues wake us up,
we'll have TASK_RUNNING set when we enter schedule() and will go right
back to check for whatever it is we're waiting on.

I tested this patch with the test that hung, but this only happened on
one vm last night, and these tests run twice on 3 vm's every night and
this is the first time I hit the problem, so it's a rare occurrence.

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
---
 fs/pipe.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 60dbee457143..8803a11cbc1b 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -116,8 +116,9 @@ void pipe_wait(struct pipe_inode_info *pipe)
 	 * Pipes are system-local resources, so sleeping on them
 	 * is considered a noninteractive wait:
 	 */
-	prepare_to_wait(&pipe->rd_wait, &rdwait, TASK_INTERRUPTIBLE);
-	prepare_to_wait(&pipe->wr_wait, &wrwait, TASK_INTERRUPTIBLE);
+	set_current_state(TASK_INTERRUPTIBLE);
+	add_wait_queue(&pipe->rd_wait, &rdwait);
+	add_wait_queue(&pipe->wr_wait, &wrwait);
 	pipe_unlock(pipe);
 	schedule();
 	finish_wait(&pipe->rd_wait, &rdwait);
-- 
2.26.2