On Wed, Jun 15, 2022 at 10:57:48AM +0300, Ilpo Järvinen wrote: > Hi Greg, > > On Wed, 15 Jun 2022, Greg KH wrote: > > > On Wed, Jun 15, 2022 at 11:45:10AM +0800, cael wrote: > > > We have met a hang on pty device, the reader was blocking > > > at epoll on master side, the writer was sleeping at wait_woken > > > inside n_tty_write on slave side, and the write buffer on > > > tty_port was full, we found that the reader and writer would > > > never be woken again and blocked forever. > > > > > > The problem was caused by a race between reader and kworker: > > > n_tty_read(reader): n_tty_receive_buf_common(kworker): > > > |room = N_TTY_BUF_SIZE - (ldata->read_head - tail) > > > |room <= 0 > > > copy_from_read_buf()| > > > n_tty_kick_worker() | > > > |ldata->no_room = true > > > > > > After writing to slave device, writer wakes up kworker to flush > > > data on tty_port to reader, and the kworker finds that reader > > > has no room to store data so room <= 0 is met. At this moment, > > > reader consumes all the data on reader buffer and calls > > > n_tty_kick_worker to check ldata->no_room which is false and > > > reader quits reading. Then kworker sets ldata->no_room=true > > > and quits too. > > > > > > If write buffer is not full, writer will wake kworker to flush data > > > again after following writes, but if write buffer is full and writer > > > goes to sleep, kworker will never be woken again and tty device is > > > blocked. > > > > > > This problem can be solved with a check for read buffer size inside > > > n_tty_receive_buf_common, if read buffer is empty and ldata->no_room > > > is true, a call to n_tty_kick_worker is necessary to keep flushing > > > data to reader. > > > > > > Signed-off-by: cael <juanfengpy@xxxxxxxxx> > > > --- > > > Patch changelogs between v1 and v2: > > > -add barrier inside n_tty_read and n_tty_receive_buf_common; > > > -comment why barrier is needed; > > > -access to ldata->no_room is changed with READ_ONCE and WRITE_ONCE; > > > Patch changelogs between v2 and v3: > > > -in function n_tty_receive_buf_common, add unlikely to check > > > ldata->no_room, eg: if (unlikely(ldata->no_room)), and READ_ONCE > > > is removed here to get locality; > > > -change comment for barrier to show the race condition to make > > > comment easier to understand; > > > Patch changelogs between v3 and v4: > > > -change subject from 'tty: fix a possible hang on tty device' to > > > 'tty: fix hang on tty device with no_room set' to make subject > > > more obvious. > > > > This is the friendly patch-bot of Greg Kroah-Hartman. You have sent him > > a patch that has triggered this response. He used to manually respond > > to these common problems, but in order to save his sanity (he kept > > writing the same thing over and over, yet to different people), I was > > created. Hopefully you will not take offence and will fix the problem > > in your patch and resubmit it so that it can be accepted into the Linux > > kernel tree. > > > > You are receiving this message because of the following common error(s) > > as indicated below: > > [...snip...] > > > - This looks like a new version of a previously submitted patch, but you > > did not list below the --- line any changes from the previous version. > > Please read the section entitled "The canonical patch format" in the > > kernel file, Documentation/SubmittingPatches for what needs to be done > > here to properly describe this. > > I think your bot's changelog heuristic got it wrong here. He provided > the list of changes as you can see above. Ah, yeah, it didn't catch the changelog text at all, will go fix that up... > (The name thing might still be valid though, I've no idea which names are > real and which are not). Yes, for the name reason alone we can't take this change :( thanks, greg k-h