On 3/7/19 2:37 PM, Josef Bacik wrote: > We discovered a problem in newer kernels where a disconnect of a NBD > device while the flush request was pending would result in a hang. This > is because the blk mq timeout handler does > > if (!refcount_inc_not_zero(&rq->ref)) > return true; > > to determine if it's ok to run the timeout handler for the request. > Flush_rq's don't have a ref count set, so we'd skip running the timeout > handler for this request and it would just sit there in limbo forever. > > Fix this by always setting the refcount of any request going through > blk_init_rq() to 1. I tested this with a nbd-server that dropped flush > requests to verify that it hung, and then tested with this patch to > verify I got the timeout as expected and the error handling kicked in. > Thanks, Looks good to me, thanks Josef. -- Jens Axboe