On 17. 03. 23, 3:41, juanfengpy@xxxxxxxxx wrote:
From: Hui Li <caelli@xxxxxxxxxxx>
We have met a hang on pty device, the reader was blocking
at epoll on master side, the writer was sleeping at wait_woken
inside n_tty_write on slave side, and the write buffer on
tty_port was full, we found that the reader and writer would
never be woken again and blocked forever.
The problem was caused by a race between reader and kworker:
n_tty_read(reader): n_tty_receive_buf_common(kworker):
copy_from_read_buf()|
|room = N_TTY_BUF_SIZE - (ldata->read_head - tail)
|room <= 0
n_tty_kick_worker() |
|ldata->no_room = true
After writing to slave device, writer wakes up kworker to flush
data on tty_port to reader, and the kworker finds that reader
has no room to store data so room <= 0 is met. At this moment,
reader consumes all the data on reader buffer and calls
n_tty_kick_worker to check ldata->no_room which is false and
reader quits reading. Then kworker sets ldata->no_room=true
and quits too.
...
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
...
@@ -1729,6 +1729,27 @@ n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
} else
n_tty_check_throttle(tty);
+ if (unlikely(ldata->no_room)) {
+ /*
+ * Barrier here is to ensure to read the latest read_tail in
+ * chars_in_buffer() and to make sure that read_tail is not loaded
+ * before ldata->no_room is set,
I am not sure I would keep the following part of the comment in the code:
> otherwise, following race may occur:
+ * n_tty_receive_buf_common()
+ * n_tty_read()
+ * if (!chars_in_buffer(tty))->false
+ * copy_from_read_buf()
+ * read_tail=commit_head
+ * n_tty_kick_worker()
+ * if (ldata->no_room)->false
+ * ldata->no_room = 1
+ * Then both kworker and reader will fail to kick n_tty_kick_worker(),
+ * smp_mb is paired with smp_mb() in n_tty_read().
I would only let it ^^^ documented in the commit log as you did.
+ */
+ smp_mb();
+ if (!chars_in_buffer(tty))
+ n_tty_kick_worker(tty);
+ }
+
up_read(&tty->termios_rwsem);
return rcvd;
@@ -2282,8 +2303,25 @@ static ssize_t n_tty_read(struct tty_struct *tty, struct file *file,
if (time)
timeout = time;
}
- if (old_tail != ldata->read_tail)
+ if (old_tail != ldata->read_tail) {
+ /*
+ * Make sure no_room is not read in n_tty_kick_worker()
+ * before setting ldata->read_tail in copy_from_read_buf(),
The same here (it's only repeated). I think the above two lines are
enough for the comment. We have git blame after all.
+ * otherwise, following race may occur:
+ * n_tty_read()
+ * n_tty_receive_buf_common()
+ * n_tty_kick_worker()
+ * if(ldata->no_room)->false
+ * ldata->no_room = 1
+ * if (!chars_in_buffer(tty))->false
+ * copy_from_read_buf()
+ * read_tail=commit_head
+ * Both reader and kworker will fail to kick tty_buffer_restart_work(),
+ * smp_mb is paired with smp_mb() in n_tty_receive_buf_common().
+ */
+ smp_mb();
n_tty_kick_worker(tty);
+ }
up_read(&tty->termios_rwsem);
remove_wait_queue(&tty->read_wait, &wait);
--
js