RE: virtio-blk: support completion batching for the IRQ path - failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I assume your find command should all be on the same line,
i.e.
find . -type f -exec grep -aH . {} \;

-being a typical linux write-only style command, I have no idea what this translates to...

However, /sys/kernel/debug/block/vda (and /vdb ... /vdp) are all empty directories, and the find command returns nothing.
What are you hoping to find in these paths? - created by who?

I have no VMs running; the block devices are implemented in hardware, not in QEMU, attached as PFs to the host.
Host has 160 cores (dual socket with hyperthreads) - so 40 cores per CPU. 256GB RAM in total.
fio is running on the host.
fio threads/kblockd worker threads/irqs seem to get scattered across all cores, on both sockets! which doesn’t seem like a very efficient approach, but may not be a factor.


As previously indicated; I have added counts in the IRQ routine, which show that all completions are accounted for inside the virtio-blk driver.
However, the driver then passes back to the block stack (presumably via blk_mq_add_to_batch()), which presumably then fails to process all the completions properly, leaving fio in the lurch. The virtio_mq_ops.complete callback in the virtio-blk driver (virtblk_request_done())  _never_ gets called.
With the earlier code (5.15), this gets called for every packet.
With, rq_affinity=2 (6.3.3), then it is called for most (~95%) packets but the system still hangs; albeit far less frequently - perhaps only on one of those calls that doesn't trigger the .complete callback?
i.e. it all points to a failure in the batching mechanism.

Martin


-----Original Message-----
From: Suwan Kim <suwan.kim027@xxxxxxxxx> 
Sent: Monday, June 12, 2023 4:05 PM
To: Roberts, Martin <martin.roberts@xxxxxxxxx>
Cc: mst@xxxxxxxxxx; virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx
Subject: Re: virtio-blk: support completion batching for the IRQ path - failure

Hi Martin,

I'm trying to reproduce the issue but in my machine, IO hang doesn't happen.
I attached upto 16 disk images to vm and set various vCPU number and
memory size (3~16 vCPU, 2~8GB RAM)
Could you let me know your VM settings?

And in order to know if the IO hang is triggered by driver,
Could you please share a log when IO hang happens?
You can get a log for each /dev/vd* with below command

cd /sys/kernel/debug/block/[test_device] && find . -type f -exec grep
-aH . {} \;

Regards,
Suwan Kim


On Thu, Jun 8, 2023 at 7:16 PM Roberts, Martin <martin.roberts@xxxxxxxxx> wrote:
>
> The rq_affinity change does not resolve the issue; just reduces its occurrence rate; I am still seeing hangs with it set to 2.
>
> Martin
>
>
>
> From: Roberts, Martin
> Sent: Wednesday, June 7, 2023 3:46 PM
> To: Suwan Kim <suwan.kim027@xxxxxxxxx>
> Cc: mst@xxxxxxxxxx; virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx
> Subject: RE: virtio-blk: support completion batching for the IRQ path - failure
>
>
>
> It is the change indicated that breaks it - changing the IRQ handling to batching.
>
>
>
>
>
>
>
> From reports such as,
>
> [PATCH 1/1] blk-mq: added case for cpu offline during send_ipi in rq_complete (kernel.org)
>
> [RFC] blk-mq: Don't IPI requests on PREEMPT_RT - Patchwork (linaro.org)
>
>
>
> I’m thinking the issue has something to do with which CPU the IRQ is running on.
>
>
>
> So, I set,
>
> # echo 2 > /sys/block/vda/queue/rq_affinity
>
> # echo 2 > /sys/block/vdb/queue/rq_affinity
>
> …
>
> # echo 2 > /sys/block/vdp/queue/rq_affinity
>
>
>
>
>
> and the system (running 16 disks, 4 queues/disk) has not yet hung (running OK for several hours)…
>
>
>
> Martin
>
>
>
> -----Original Message-----
> From: Suwan Kim <suwan.kim027@xxxxxxxxx>
> Sent: Wednesday, June 7, 2023 3:21 PM
> To: Roberts, Martin <martin.roberts@xxxxxxxxx>
> Cc: mst@xxxxxxxxxx; virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx>; linux-block@xxxxxxxxxxxxxxx
> Subject: Re: virtio-blk: support completion batching for the IRQ path - failure
>
>
>
> On Wed, Jun 7, 2023 at 6:14 PM Roberts, Martin <martin.roberts@xxxxxxxxx> wrote:
>
> >
>
> > Re: virtio-blk: support completion batching for the IRQ path · torvalds/linux@07b679f · GitHub
>
> >
>
> > Signed-off-by: Suwan Kim suwan.kim027@xxxxxxxxx
>
> >
>
> > Signed-off-by: Michael S. Tsirkin mst@xxxxxxxxxx
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > This change appears to have broken things…
>
> >
>
> > We now see applications hanging during disk accesses.
>
> >
>
> > e.g.
>
> >
>
> > multi-port virtio-blk device running in h/w (FPGA)
>
> >
>
> > Host running a simple ‘fio‘ test.
>
> >
>
> > [global]
>
> >
>
> > thread=1
>
> >
>
> > direct=1
>
> >
>
> > ioengine=libaio
>
> >
>
> > norandommap=1
>
> >
>
> > group_reporting=1
>
> >
>
> > bs=4K
>
> >
>
> > rw=read
>
> >
>
> > iodepth=128
>
> >
>
> > runtime=1
>
> >
>
> > numjobs=4
>
> >
>
> > time_based
>
> >
>
> > [job0]
>
> >
>
> > filename=/dev/vda
>
> >
>
> > [job1]
>
> >
>
> > filename=/dev/vdb
>
> >
>
> > [job2]
>
> >
>
> > filename=/dev/vdc
>
> >
>
> > ...
>
> >
>
> > [job15]
>
> >
>
> > filename=/dev/vdp
>
> >
>
> >
>
> >
>
> > i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
>
> >
>
> > This is repeatedly run in a loop.
>
> >
>
> >
>
> >
>
> > After a few, normally <10 seconds, fio hangs.
>
> >
>
> > With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
>
> >
>
> > Last message:
>
> >
>
> > fio-3.19
>
> >
>
> > Starting 8 threads
>
> >
>
> > Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
>
> >
>
> > I think this means at the end of the run 1 queue was left incomplete.
>
> >
>
> >
>
> >
>
> > ‘diskstats’ (run while fio is hung) shows no outstanding transactions.
>
> >
>
> > e.g.
>
> >
>
> > $ cat /proc/diskstats
>
> >
>
> > ...
>
> >
>
> > 252       0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
>
> >
>
> > 252      16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
>
> >
>
> > ...
>
> >
>
> >
>
> >
>
> > Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
>
> >
>
> > e.g.
>
> >
>
> > PF= 0                         vq=0           1           2           3
>
> >
>
> > [a]request_count     -   839416590   813148916   105586179    84988123
>
> >
>
> > [b]completion1_count -   839416590   813148916   105586179    84988123
>
> >
>
> > [c]completion2_count -           0           0           0           0
>
> >
>
> >
>
> >
>
> > PF= 1                         vq=0           1           2           3
>
> >
>
> > [a]request_count     -   823335887   812516140   104582672    75856549
>
> >
>
> > [b]completion1_count -   823335887   812516140   104582672    75856549
>
> >
>
> > [c]completion2_count -           0           0           0           0
>
> >
>
> >
>
> >
>
> > i.e. the issue is after the virtio-blk driver.
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > This change was introduced in kernel 6.3.0.
>
> >
>
> > I am seeing this using 6.3.3.
>
> >
>
> > If I run with an earlier kernel (5.15), it does not occur.
>
> >
>
> > If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
>
> >
>
> > e.g.
>
> >
>
> > kernel 5.15 – this is OK
>
> >
>
> > virtio_blk.c,virtblk_done() [irq handler]
>
> >
>
> >                  if (likely(!blk_should_fake_timeout(req->q))) {
>
> >
>
> >                           blk_mq_complete_request(req);
>
> >
>
> >                  }
>
> >
>
> >
>
> >
>
> > kernel 6.3.3 – this fails
>
> >
>
> > virtio_blk.c,virtblk_handle_req() [irq handler]
>
> >
>
> >                  if (likely(!blk_should_fake_timeout(req->q))) {
>
> >
>
> >                           if (!blk_mq_complete_request_remote(req)) {
>
> >
>
> >                                   if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
>
> >
>
> >                                            virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
>
> >
>
> >                                    }
>
> >
>
> >                           }
>
> >
>
> >                  }
>
> >
>
> >
>
> >
>
> > If I do, kernel 6.3.3 – this is OK
>
> >
>
> > virtio_blk.c,virtblk_handle_req() [irq handler]
>
> >
>
> >                  if (likely(!blk_should_fake_timeout(req->q))) {
>
> >
>
> >                           if (!blk_mq_complete_request_remote(req)) {
>
> >
>
> >                                    virtblk_request_done(req); //force this here...
>
> >
>
> >                                   if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
>
> >
>
> >                                            virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
>
> >
>
> >                                    }
>
> >
>
> >                           }
>
> >
>
> >                  }
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > Perhaps you might like to fix/test/revert this change…
>
> >
>
> > Martin
>
> >
>
> >
>
>
>
> Hi Martin,
>
>
>
> There are many changes between 6.3.0 and 6.3.3.
>
> Could you try to find a commit which triggers the io hang?
>
> Is it ok with 6.3.0 kernel or with reverting
>
> "virtio-blk: support completion batching for the IRQ path" commit?
>
>
>
> We need to confirm which commit is causing the error.
>
>
>
> Regards,
>
> Suwan Kim




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux