On Wed, 1 Jun 2022, cael wrote: > From: cael <juanfengpy@xxxxxxxxx> > Subject: [PATCH v2] tty: fix a possible hang on tty device > > We have met a hang on pty device, the reader was blocking > at epoll on master side, the writer was sleeping at wait_woken > inside n_tty_write on slave side, and the write buffer on > tty_port was full, we found that the reader and writer would > never be woken again and block forever. > > The problem was caused by a race between reader and kworker: > n_tty_read(reader): n_tty_receive_buf_common(kworker): > |room = N_TTY_BUF_SIZE - (ldata->read_head - tail) > |room <= 0 > copy_from_read_buf()| > n_tty_kick_worker() | > |ldata->no_room = true > > After writing to slave device, writer wakes up kworker to flush > data on tty_port to reader, and the kworker finds that reader > has no room to store data so room <= 0 is met. At this moment, > reader consumes all the data on reader buffer and call > n_tty_kick_worker to check ldata->no_room which is false and > reader quits reading. Then kworker sets ldata->no_room=true > and quits too. > > If write buffer is not full, writer will wake kworker to flush data > again after following writes, but if writer buffer is full and writer > goes to sleep, kworker will never be woken again and tty device is > blocked. > > This problem can be solved with a check for read buffer size inside > n_tty_receive_buf_common, if read buffer is empty and ldata->no_room > is true, a call to n_tty_kick_worker is necessary to keep flushing > data to reader. > > Signed-off-by: cael <juanfengpy@xxxxxxxxx> > Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@xxxxxxxxxxxxxxx> You should not add Reviewed-by on your own. Only after the person himself/herself gives that tag for you, include it. > > --- > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c > index efc72104c840..21241ea7cdb9 100644 > --- a/drivers/tty/n_tty.c > +++ b/drivers/tty/n_tty.c > @@ -201,8 +201,8 @@ static void n_tty_kick_worker(struct tty_struct *tty) > struct n_tty_data *ldata = tty->disc_data; > > /* Did the input worker stop? Restart it */ > - if (unlikely(ldata->no_room)) { > - ldata->no_room = 0; > + if (unlikely(READ_ONCE(ldata->no_room))) { > + WRITE_ONCE(ldata->no_room, 0); > > WARN_RATELIMIT(tty->port->itty == NULL, > "scheduling with invalid itty\n"); > @@ -1632,7 +1632,7 @@ n_tty_receive_buf_common(struct tty_struct *tty, > const unsigned char *cp, > if (overflow && room < 0) > ldata->read_head--; > room = overflow; > - ldata->no_room = flow && !room; > + WRITE_ONCE(ldata->no_room, flow && !room); > } else > overflow = 0; > > @@ -1663,6 +1663,21 @@ n_tty_receive_buf_common(struct tty_struct > *tty, const unsigned char *cp, > } else > n_tty_check_throttle(tty); > > + if (READ_ONCE(ldata->no_room)) { Hmm, since this function is only one setting it to non-zero value, perhaps the information could be carried over here in a no_room local var (and maybe unlikely() would be useful too similar to n_tty_kick_worker). After all, this check is just an optimization for the common case where we know no_room is definitely zero. > + /* > + * Reader ensures that read_tail is updated before > checking no_room, > + * make sure that no_room is set before reading read_tail here. > + * Now no_room is visible by reader, the race needs to > be handled is > + * that reader has passed checkpoint for no_room and > reader buffer > + * is empty, if so n_tty_kick_worker will not be > called by reader, > + * instead, this function is called here. This part is hard to parse/understand. Please try to rephrase. -- i.