Re: [PATCH] blk-mq: fix corruption with direct issue

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 7 Dec 2018 11:44:39 +0800

On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > 
> > But at that time, there isn't io scheduler for MQ, so in theory the
> > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.

Hi Theodore,

It is just a theory analysis.

blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(),
both are on real MQ disks.

IO merge can be done on none or real io schedulers, so in theory there might
be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
especially it was fixed by ab42f35d9cb5ac49b5a2.

As Jens mentioned in bugzilla, there are several conditions required
for triggering the issue:

- MQ device

- queue busy can be triggered. It is hard to trigger in NVMe PCI,
  but may be possible on NVMe FC. However, it can be quite easy to
  trigger on SCSI devices. We know there are some MQ SCSI HBA,
  qlogic FC, megaraid_sas.

- IO merge is enabled. 

I have setup scsi_debug in the following way:

modprobe scsi_debug dev_size_mb=4096 clustering=1 \
		max_luns=1 submit_queues=2 max_queue=2

- submit_queues=2 may set this disk as MQ
- max_queue=4 may trigger the queue busy condition easily

and run some write IO on ext4 over the disk: fio, kernel building,... for
some time, but still can't trigger the data corruption once.

I should have created more LUN, so that queue may be easier to become
busy, will do that soon.

> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.

virtio_scsi supports multi-queue mode, if that is enabled in your
setting, you may try single queue and see if difference can be made.

If multi-queue mode isn't enabled, your problem should be different with
this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE.

> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

IO hang is only triggered on machine with special CPU topo, it should be
fine to revert b5b6e8c8d3b4 on normal VM.

No other suggestions yet.

Thanks,
Ming