On 08/13/2019 10:45 AM, Mike Christie wrote: > On 08/13/2019 08:13 AM, Josef Bacik wrote: >> On Fri, Aug 09, 2019 at 04:26:10PM -0500, Mike Christie wrote: >>> This fixes a regression added in 4.9 with commit: >>> >>> commit 0eadf37afc2500e1162c9040ec26a705b9af8d47 >>> Author: Josef Bacik <jbacik@xxxxxx> >>> Date: Thu Sep 8 12:33:40 2016 -0700 >>> >>> nbd: allow block mq to deal with timeouts >>> >>> where before the patch userspace would set the timeout to 0 to disable >>> it. With the above patch, a zero timeout tells the block layer to use >>> the default value of 30 seconds. For setups where commands can take a >>> long time or experience transient issues like network disruptions this >>> then results in IO errors being sent to the application. >>> >>> To fix this, the patch still uses the common block layer timeout >>> framework, but if zero is set, nbd just logs a message and then resets >>> the timer when it expires. >>> >>> Josef, >>> >>> I did not cc stable, but I think we want to port the patches to some >>> releases. We originally hit this with users using the longterm kernels >>> with ceph. The patch does not apply anywhere cleanly with older ones >>> like 4.9, so I was not sure how we wanted to handle it. >>> >> >> I assume you tested this? IIRC there was a problem where 0 really meant 0 and > > Yes. > >> commands would insta-timeout. But my memory is foggy here, so I'm not sure if >> it was setting the tag_set timeout to 0 that made things go wrong, or what. Or >> I could be making it all up, who knows. > > Yes, if you call blk_queue_rq_timeout with 0, then the command will > timeout almost immediately. I added a check for this in the first patch. > > If blk_mq_tag_set.timeout is 0, blk_mq_init_allocated_queue uses the > default 30 second value. > > So with the patch if the user sets the timeout to 0, then we will just > log a message every 30 seconds that the command is stuck. > >> >> There's a blktest that just runs fio on a normal device with no timeouts or >> anything, that's where I would see the problem since it was a little racy. >> Basically have the timeout set to 0 and put load on the disk and eventually >> you'd start seeing timeouts. If that all goes fine then you can add Oh yeah just to be clear that is another issue that you can hit with any driver. If a app/user sets the timeout value in sysfs: /sys/block/$dev/queue/io_timeout then it bypasses the driver completely because it just does queue_io_timeout_store -> blk_queue_rq_timeout and that function/interface lets you set the timeout to anything. My patches just fix up the nbd interface that existing tools were using and hitting regressions with. I was debating about sending a patch for not allowing blk_queue_rq_timeout(q, 9) in a separate patchset, but I was not sure if people use that for testing fast timeouts.