On Fri, 2012-11-16 at 01:55 -0800, Nicholas A. Bellinger wrote: > On Thu, 2012-11-15 at 23:59 -0800, Nicholas A. Bellinger wrote: > > On Thu, 2012-11-15 at 15:50 -0800, Nicholas A. Bellinger wrote: > > > On Thu, 2012-11-15 at 21:26 +0000, Prantis, Kelsey wrote: > > > > On 11/13/12 2:22 PM, "Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > <SNIP> > > > > > > > Hi Nicholas, > > > > > > > > Sorry for the delay. The new debug output with your latest patch (and typo > > > > adjustment) is up at ftp://ftp.whamcloud.com/uploads/lio-debug-4.txt.bz2 > > > > > > > Hi Kelsey, <SNIP> > One more update on this bug.. > > So after increasing max_sectors_kb for the virtio-blk w/ IBLOCK in KVM > guest from default 512 to 2048 with: > > echo 2048 > /sys/block/vda/queue/max_sectors_kb > > as well as bumping the hw/virtio-blk.c max_seg default in qemu code from > 126 to 258 in order to make virtio-blk guest vdX struct block_device > automatically register with max_segments=256 by default: > > diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c > index 6f6d172..c929b6b 100644 > --- a/hw/virtio-blk.c > +++ b/hw/virtio-blk.c > @@ -485,7 +485,7 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config) > bdrv_get_geometry(s->bs, &capacity); > memset(&blkcfg, 0, sizeof(blkcfg)); > stq_raw(&blkcfg.capacity, capacity); > - stl_raw(&blkcfg.seg_max, 128 - 2); > + stl_raw(&blkcfg.seg_max, 258 - 2); > stw_raw(&blkcfg.cylinders, s->conf->cyls); > stl_raw(&blkcfg.blk_size, blk_size); > stw_raw(&blkcfg.min_io_size, s->conf->min_io_size / blk_size); > > These two changes seem to provide a working v3.6-rc6 guest virtio-blk > +IBLOCK setup that is (so far) passing fio write-verify against a local > tcm_loop LUN.. > > After changing max_sectors_kb 512 -> 2048 for virtio-blk, the avgrq-sz > ratio between virtio-block+tcm_loop block devices is now 4k vs. 16K (vs. > 1K to 16K), which likely means a stack overflow somewhere in virtio-blk > -> virtio code while processing a large (8 MB) struct request generated > by an SCSI initiator port. > > Not sure just yet if the qemu virtio-blk max_segments=126 -> 256 change > is necessary for the work-around, but might be worthwhile if you have a > qemu build environment setup. Will try a bit more with max_segments=126 > later today. > So during sustained testing overnight with the above changes, I still managed to trigger another OOPs. However this time it appears to be pointing at virtio-blk.c code.. [ 7728.484801] BUG: unable to handle kernel paging request at 0000000200000045 [ 7728.485713] IP: [<ffffffffa006717b>] blk_done+0x51/0xf1 [virtio_blk] [ 7728.485713] PGD 0 [ 7728.485713] Oops: 0000 [#1] SMP [ 7728.485713] Modules linked in: ib_srpt ib_cm ib_sa ib_mad ib_core tcm_qla2xxx qla2xxx tcm_loop tcm_fc libfc scsi_transport_fc iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_core_mod configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 e1000 virtio_blk sd_mod sr_mod cdrom virtio_pci virtio_ring virtio ata_piix libata [ 7728.485713] CPU 1 [ 7728.485713] Pid: 0, comm: swapper/1 Not tainted 3.6.0-rc6+ #4 Bochs Bochs [ 7728.485713] RIP: 0010:[<ffffffffa006717b>] [<ffffffffa006717b>] blk_done+0x51/0xf1 [virtio_blk] [ 7728.485713] RSP: 0018:ffff88007fc83e98 EFLAGS: 00010093 [ 7728.485713] RAX: 0000000200000001 RBX: ffff88007ad00000 RCX: 0000000000000080 [ 7728.485713] RDX: 000000000000538a RSI: 0000000000000000 RDI: ffffea0001eddc50 [ 7728.485713] RBP: 0000000000000092 R08: ffffea0001eddc58 R09: ffff88007d002700 [ 7728.485713] R10: ffffffffa00410f9 R11: fffffffffffffff0 R12: ffff88007fc83ea4 [ 7728.485713] R13: ffff88007b771478 R14: ffff88007bf197b0 R15: 0000000000000000 [ 7728.485713] FS: 0000000000000000(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 [ 7728.485713] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 7728.485713] CR2: 0000000200000045 CR3: 00000000014e7000 CR4: 00000000000006e0 [ 7728.485713] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7728.485713] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 7728.485713] Process swapper/1 (pid: 0, threadinfo ffff88007d392000, task ffff88007d363380) [ 7728.485713] Stack: [ 7728.485713] 0000002924276c47 00200001810229fc 0000000100715b34 ffff88007ae12800 [ 7728.485713] ffff88007bf19700 0000000000000000 0000000000000029 ffffffffa0041286 [ 7728.485713] ffff880037bb8800 ffffffff8108448c 000010e324804758 0000000000000000 [ 7728.485713] Call Trace: [ 7728.485713] <IRQ> [ 7728.485713] [<ffffffffa0041286>] ? vring_interrupt+0x6f/0x76 [virtio_ring] [ 7728.485713] [<ffffffff8108448c>] ? handle_irq_event_percpu+0x2d/0x130 [ 7728.485713] [<ffffffff810845bd>] ? handle_irq_event+0x2e/0x4c [ 7728.485713] [<ffffffff8108694f>] ? handle_edge_irq+0x98/0xb9 [ 7728.485713] [<ffffffff81003aa7>] ? handle_irq+0x17/0x20 [ 7728.485713] [<ffffffff810032da>] ? do_IRQ+0x45/0xad [ 7728.485713] [<ffffffff8137f72a>] ? common_interrupt+0x6a/0x6a [ 7728.485713] <EOI> [ 7728.485713] [<ffffffff810229fc>] ? native_safe_halt+0x2/0x3 [ 7728.485713] [<ffffffff8100887e>] ? default_idle+0x23/0x3f [ 7728.485713] [<ffffffff81008b02>] ? cpu_idle+0x6b/0xaa [ 7728.485713] [<ffffffff81379f33>] ? start_secondary+0x1f5/0x1fa [ 7728.485713] Code: 8b b8 b0 03 00 00 e8 55 84 31 e1 48 89 c5 eb 6e 41 8a 45 28 be fb ff ff ff 3c 02 77 0a 0f b6 c0 8b 34 85 70 81 06 a0 49 8b 45 00 <8b> 50 44 83 fa 02 74 07 83 fa 07 75 31 eb 22 41 8b 55 24 89 90 [ 7728.485713] RIP [<ffffffffa006717b>] blk_done+0x51/0xf1 [virtio_blk] [ 7728.485713] RSP <ffff88007fc83e98> [ 7728.485713] CR2: 0000000200000045 [ 7728.485713] ---[ end trace de9d8ade00a76876 ]--- [ 7728.485713] Kernel panic - not syncing: Fatal exception in interrupt So looking at the RIP with gdb, it points into the following code that pulls a struct virtblk_req *vbr off the virtio_ring with virtqueue_get_buf(): (gdb) list *(blk_done+0x51) 0x19f is in blk_done (drivers/block/virtio_blk.c:82). 77 default: 78 error = -EIO; 79 break; 80 } 81 82 switch (vbr->req->cmd_type) { 83 case REQ_TYPE_BLOCK_PC: 84 vbr->req->resid_len = vbr->in_hdr.residual; 85 vbr->req->sense_len = vbr->in_hdr.sense_len; 86 vbr->req->errors = vbr->in_hdr.errors; (gdb) So it's starting to look pretty clear that the virtio_ring used by virtio-blk is somehow getting messed up.. Now enabling DEBUG within virtio_ring.ko code to try and get some more details. virtio_ring folks (Rusty + MST CC'ed), is there any other debug code that would be helpful to track this down..? Thanks, --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html