On 12/6/18 7:46 PM, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: >> >> But at that time, there isn't io scheduler for MQ, so in theory the >> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: >> add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. > > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? We should just make SCSI do the right thing, which is to unprep if it sees BUSY and prep next time again. Otherwise I fear the direct dispatch isn't going to be super useful, if a failed direct dispatch prevents future merging. This would be a lot less error prone as well for other cases. -- Jens Axboe