On Tue, Jan 4, 2022 at 12:10 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > On Thu, Dec 30, 2021 at 12:01:23PM +0800, Yongji Xie wrote: > > On Thu, Dec 30, 2021 at 1:35 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > > > On Mon, Dec 27, 2021 at 05:12:41PM +0800, Xie Yongji wrote: > > > > The rescuer thread might take over the works queued on > > > > the workqueue when the worker thread creation timed out. > > > > If this happens, we have no chance to create multiple > > > > recv threads which causes I/O hung on this nbd device. > > > > > > If a workqueue is used there aren't really 'receive threads'. > > > What is the deadlock here? > > > > We might have multiple recv works, and those recv works won't quit > > unless the socket is closed. If the rescuer thread takes over those > > works, only the first recv work can run. The I/O needed to be handled > > in other recv works would be hung since no thread can handle them. > > > > I'm not following this explanation. What is the rescuer thread you're talking https://www.kernel.org/doc/html/latest/core-api/workqueue.html#c.rescuer_thread > about? If there's an error we close the socket which will error out the recvmsg > which will make the recv workqueue close down. When to close the socket? The nbd daemon doesn't know what happens in the kernel. > > > In that case, we can see below stacks in rescuer thread: > > > > __schedule > > schedule > > scheule_timeout > > unix_stream_read_generic > > unix_stream_recvmsg > > sock_xmit > > nbd_read_stat > > recv_work > > process_one_work > > rescuer_thread > > kthread > > ret_from_fork > > This is just the thing hanging waiting for an incoming request, so this doesn't > tell me anything. Thanks, > The point is the *recv_work* is handled in the *rescuer_thread*. Normally it should be handled in *work_thread* like: __schedule schedule scheule_timeout unix_stream_read_generic unix_stream_recvmsg sock_xmit nbd_read_stat recv_work process_one_work *work_thread* kthread ret_from_fork Thanks, Yongji