On Wed, 25 Nov 2015 20:01:47 +0100, Hannes Reinecke wrote: > > On 11/25/2015 07:01 PM, Mike Snitzer wrote: > > On Wed, Nov 25 2015 at 4:04am -0500, > > Hannes Reinecke <hare@xxxxxxx> wrote: > > > >> On 11/20/2015 04:28 PM, Ewan Milne wrote: > >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote: > >>>> Can't we have a joint effort here? > >>>> I've been spending a _LOT_ of time trying to debug things here, but > >>>> none of the ideas I've come up with have been able to fix anything. > >>> > >>> Yes. I'm not the one primarily looking at it, and we don't have a > >>> reproducer in-house. We just have the one dump right now. > >>> > >>>> > >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable() > >>>> by one and be done with ... > >>>> > >>> > >>> That might not fix it if it is a problem with the merge code, though. > >>> > >> And indeed, it doesn't. > > > > How did you arrive at that? Do you have a reproducer now? > > > Not a reproducer, but several dumps for analysis. > > >> Seems I finally found the culprit. > >> > >> What happens is this: > >> We have two paths, with these seg_boundary_masks: > >> > >> path-1: seg_boundary_mask = 65535, > >> path-2: seg_boundary_mask = 4294967295, > >> > >> consequently the DM request queue has this: > >> > >> md-1: seg_boundary_mask = 65535, > >> > >> What happens now is that a request is being formatted, and sent > >> to path 2. During submission req->nr_phys_segments is formatted > >> with the limits of path 2, arriving at a count of 3. > >> Now the request gets retried on path 1, but as the NOMERGE request > >> flag is set req->nr_phys_segments is never updated. > >> But blk_rq_map_sg() ignores all counters, and just uses the > >> bi_vec directly, resulting in a count of 4 -> boom. > >> > >> So the culprit here is the NOMERGE flag, > > > > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests. > > > Yes. > > >> which is evaluated via > >> ->dm_dispatch_request() > >> ->blk_insert_cloned_request() > >> ->blk_rq_check_limits() > > > > blk_insert_cloned_request() is the only caller of blk_rq_check_limits(); > > anyway after reading your mail I'm still left wondering if your proposed > > patch is correct. > > > >> If the above assessment is correct, the following patch should > >> fix it: > >> > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index 801ced7..12cccd6 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio); > >> */ > >> int blk_rq_check_limits(struct request_queue *q, struct request *rq) > >> { > >> - if (!rq_mergeable(rq)) > >> + if (rq->cmd_type != REQ_TYPE_FS) > >> return 0; > >> > >> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, > >> rq->cmd_flags)) { > >> > >> > >> Mike? Jens? > >> Can you comment on it? > > > > You're not explaining the actual change in the patch very well; I think > > you're correct but you're leaving the justification as an exercise to > > the reviewer: > > > > blk_rq_check_limits() will call blk_recalc_rq_segments() after the > > !rq_mergeable() check but you're saying for this case in question we > > never get there -- due to the cloned request having NOMERGE set. > > > > So in blk_rq_check_limits() you've unrolled rq_mergeable() and > > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS) > > > > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in > > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no > > sense for cloned requests that always have NOMERGE set. > > > > So you're saying that by having blk_rq_check_limits() go on to call > > blk_recalc_rq_segments() this bug will be fixed? > > > That is the idea. > > I've already established that in all instances I have seen so far > req->nr_phys_segments is _less_ than req->bio->bi_phys_segments. > > As it turns out, req->nr_phys_segemnts _would_ have been updated in > blk_rq_check_limits(), but isn't due to the NOMERGE flag being set > for the cloned request. > So each cloned request inherits the values from the original request, > despite the fact that req->nr_phys_segments _has_ to be evaluated in > the final request_queue context, as the queue limits _might_ be > different from the original (merged) queue limits of the multipath > request queue. > > > BTW, I think blk_rq_check_limits()'s export should be removed and the > > function made static and renamed to blk_clone_rq_check_limits(), again: > > blk_insert_cloned_request() is the only caller of blk_rq_check_limits() > > > Actually, seeing Jens' last comment the check for REQ_TYPE_FS is > pointless, too, so we might as well remove the entire if-clause. > > > Seems prudent to make that change now to be clear that this code is only > > used by cloned requests. > > > Yeah, that would make sense. I'll be preparing a patch. > With a more detailed description :-) Do we have already a fix? Right now I got (likely) this kernel BUG() on the almost latest Linus tree (commit 25364a9e54fb8296). It happened while I started a KVM right after a fresh boot. The machine paniced even before that, so I hit this twice today. I have a crash dump, so if anything is needed, let me know. (But the system had to be restarted since it's my main workstation.) thanks, Takashi === [ 137.211255] ------------[ cut here ]------------ [ 137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0() [ 137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore [ 137.211326] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg [ 137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183 [ 137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 [ 137.211349] Workqueue: kblockd bfq_kick_queue [ 137.211352] ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000 [ 137.211357] ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000 [ 137.211362] 0000000000000000 0000000000001000 0000000000000020 ffff880214307b00 [ 137.211367] Call Trace: [ 137.211373] [<ffffffff813b1cc9>] dump_stack+0x4b/0x72 [ 137.211378] [<ffffffff8106d9a2>] warn_slowpath_common+0x82/0xc0 [ 137.211383] [<ffffffff8106da9a>] warn_slowpath_null+0x1a/0x20 [ 137.211387] [<ffffffff81381340>] blk_rq_map_sg+0x430/0x4d0 [ 137.211393] [<ffffffff81514a60>] ? scsi_init_cmd_errh+0x90/0x90 [ 137.211398] [<ffffffff81514fc4>] scsi_init_sgtable+0x44/0x80 [ 137.211402] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0 [ 137.211408] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20 [ 137.211413] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10 [ 137.211416] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0 [ 137.211420] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150 [ 137.211424] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170 [ 137.211428] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320 [ 137.211433] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610 [ 137.211438] [<ffffffff81374943>] __blk_run_queue+0x33/0x40 [ 137.211442] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40 [ 137.211447] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660 [ 137.211451] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660 [ 137.211456] [<ffffffff8108b64e>] worker_thread+0x4e/0x450 [ 137.211460] [<ffffffff8108b600>] ? process_one_work+0x660/0x660 [ 137.211464] [<ffffffff8108b600>] ? process_one_work+0x660/0x660 [ 137.211468] [<ffffffff8109206e>] kthread+0xfe/0x120 [ 137.211473] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230 [ 137.211478] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70 [ 137.211482] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230 [ 137.211485] ---[ end trace b39a62f95b1cad19 ]--- [ 137.211499] ------------[ cut here ]------------ [ 137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096! [ 137.211511] invalid opcode: 0000 [#1] PREEMPT SMP [ 137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore [ 137.211630] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg [ 137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G W 4.4.0-rc3-test2+ #183 [ 137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 [ 137.211675] Workqueue: kblockd bfq_kick_queue [ 137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000 [ 137.211713] RIP: 0010:[<ffffffff81514ff2>] [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80 [ 137.211747] RSP: 0018:ffff880214307b90 EFLAGS: 00010097 [ 137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006 [ 137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009 [ 137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000 [ 137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320 [ 137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0 [ 137.211930] FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 [ 137.211961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0 [ 137.212021] Stack: [ 137.212048] ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0 [ 137.212082] ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0 [ 137.212115] 000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020 [ 137.212148] Call Trace: [ 137.212178] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0 [ 137.212205] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20 [ 137.212232] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10 [ 137.212258] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0 [ 137.212285] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150 [ 137.212312] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170 [ 137.212338] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320 [ 137.212365] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610 [ 137.212392] [<ffffffff81374943>] __blk_run_queue+0x33/0x40 [ 137.212418] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40 [ 137.212444] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660 [ 137.212471] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660 [ 137.212498] [<ffffffff8108b64e>] worker_thread+0x4e/0x450 [ 137.212525] [<ffffffff8108b600>] ? process_one_work+0x660/0x660 [ 137.212552] [<ffffffff8108b600>] ? process_one_work+0x660/0x660 [ 137.212578] [<ffffffff8109206e>] kthread+0xfe/0x120 [ 137.212605] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230 [ 137.212631] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70 [ 137.212657] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230 [ 137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 137.212798] RIP [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80 [ 137.212826] RSP <ffff880214307b90> -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html