Re: BUG: unable to handle kernel NULL pointer dereference in transport_allocate_data_tasks+0x1af/0x2cc [target_core_mod]

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Mon, 08 Aug 2011 18:37:44 -0700

On Mon, 2011-08-08 at 22:57 +0100, Chris Boot wrote:
> On 08/08/11 21:42, Nicholas A. Bellinger wrote:
> > On Sun, 2011-08-07 at 19:16 +0100, Chris Boot wrote:
> >> Hi all,
> >>
> >> I've been using LIO for a little while, and thought I'd give what's
> >> currently in linux-2.6.git (master) a shot. Unfortunately, I get a BUG
> >> within seconds of an initiator connecting to one of my targets - this
> >> isn't immediate but happens a couple of seconds in. The kernel log is below:
> >>
> > Greetings again Chris,
> >
> > Thank you for your detailed bug reported with the v4.1.0-rc1 code.  My
> > comments are below.
> >
> >> [  109.599886] BUG: unable to handle kernel NULL pointer dereference at
> >> 000000000000000c
> >> [  109.617486] IP: [<ffffffffa048ca5a>]
> >> transport_allocate_data_tasks+0x1af/0x2cc [target_core_mod]
> >> [  109.636400] PGD 0
> >> [  109.647538] Oops: 0000 [#1] SMP
> >> [  109.660125] CPU 1
> >> [  109.662253] Modules linked in: iscsi_target_mod target_core_pscsi
> >> target_core_file target_core_iblock target_core_mod configfs
> >> ip6table_mangle iptable_mangle xt_DSCP xt_owner tun ip6table_filter
> >> ip6_tables iptable_filter ip_tables cpufreq_stats cpufreq_userspace
> >> cpufreq_conservative ebtable_nat ebtables x_tables cpufreq_powersave
> >> ipmi_watchdog microcode nfsd nfs lockd fscache auth_rpcgss nfs_acl
> >> sunrpc bridge stp ext4 jbd2 crc16 ipmi_si ipmi_devintf ipmi_msghandler
> >> acpi_cpufreq mperf coretemp fuse crc32c_intel aesni_intel cryptd
> >> aes_x86_64 aes_generic dm_crypt kvm_intel kvm snd_pcm snd_timer snd
> >> soundcore snd_page_alloc processor evdev i2c_i801 ftdi_sio i2c_core
> >> joydev usbserial video pcspkr button ext3 jbd mbcache dm_mod sg sd_mod
> >> crc_t10dif usbhid hid ahci libahci 3w_9xxx libata ehci_hcd thermal fan
> >> scsi_mod thermal_sys usbcore e1000e [last unloaded: scsi_wait_scan]
> >> [  109.674461]
> >> [  109.674464] Pid: 4703, comm: iscsi_trx Not tainted 3.0.0+ #1 Intel
> >> Corporation S1200BTL/S1200BTL
> >> [  109.674468] RIP: 0010:[<ffffffffa048ca5a>]  [<ffffffffa048ca5a>]
> >> transport_allocate_data_tasks+0x1af/0x2cc [target_core_mod]
> >> [  109.674478] RSP: 0018:ffff880187ccdcd0  EFLAGS: 00010283
> >> [  109.674480] RAX: 0000000000000000 RBX: ffff8801877dd280 RCX:
> >> 0000000000000000
> >> [  109.674483] RDX: 0000000000000000 RSI: 0000000000000020 RDI:
> >> ffff88022dbd23e0
> >> [  109.674485] RBP: ffff880230327c00 R08: 0000000000000000 R09:
> >> 0000000000000040
> >> [  109.674487] R10: ffff88018769cc00 R11: ffff8801877dd040 R12:
> >> 0000000000000000
> >> [  109.674490] R13: 0000000000000002 R14: ffff880189884400 R15:
> >> 0000000000000100
> >> [  109.674493] FS:  0000000000000000(0000) GS:ffff880236820000(0000)
> >> knlGS:0000000000000000
> >> [  109.674496] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> [  109.674498] CR2: 000000000000000c CR3: 0000000001604000 CR4:
> >> 00000000000426e0
> >> [  109.674500] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> 0000000000000000
> >> [  109.674503] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> 0000000000000400
> >> [  109.674506] Process iscsi_trx (pid: 4703, threadinfo
> >> ffff880187ccc000, task ffff8801edb571e0)
> >> [  109.674508] Stack:
> >> [  109.674509]  0000000000000000 ffff880200000000 ffff88022dbd23e0
> >> 0000000000000100
> >> [  109.674513]  0000000100000200 000000000062ddf0 0000004000000020
> >> ffff8801877dd40c
> >> [  109.674516]  0000004100000002 ffff8801877dd4a8 0000000000000000
> >> ffff8801877dd280
> >> [  109.674520] Call Trace:
> >> [  109.674529]  [<ffffffffa048f7d8>] ?
> >> transport_generic_new_cmd+0x190/0x2f6 [target_core_mod]
> >> [  109.674537]  [<ffffffffa04c413e>] ? iscsit_allocate_iovecs+0x24/0x4d
> >> [iscsi_target_mod]
> >> [  109.674545]  [<ffffffffa048f9a5>] ?
> >> transport_handle_cdb_direct+0x67/0x8e [target_core_mod]
> >> [  109.674552]  [<ffffffffa04c253b>] ? iscsit_sequence_cmd+0x5f/0xd8
> >> [iscsi_target_mod]
> >> [  109.674558]  [<ffffffffa04c5b01>] ?
> >> iscsi_target_rx_thread+0x7d2/0x1a71 [iscsi_target_mod]
> >> [  109.674564]  [<ffffffff8100d027>] ? load_TLS+0x7/0xa
> >> [  109.674567]  [<ffffffff8100d60f>] ? __switch_to+0xc1/0x20e
> >> [  109.674574]  [<ffffffffa04c532f>] ?
> >> iscsit_thread_get_cpumask+0x88/0x88 [iscsi_target_mod]
> >> [  109.674578]  [<ffffffff8105e48d>] ? kthread+0x76/0x7e
> >> [  109.674582]  [<ffffffff81334a14>] ? kernel_thread_helper+0x4/0x10
> >> [  109.674585]  [<ffffffff8105e417>] ? kthread_worker_fn+0x139/0x139
> >> [  109.674588]  [<ffffffff81334a10>] ? gs_change+0x13/0x13
> >> [  109.674590] Code: 10 03 00 00 ff 90 c0 00 00 00 e9 1c 01 00 00 8b 75
> >> 10 48 89 c7 e8 97 7f d1 e0 8b 55 40 4c 8b 45 08 c7 44 24 30 00 00 00 00
> >> eb 45
> >> [  109.674603]  39 54 24 0c 77 47 4c 89 c7 4c 89 e6 b9 08 00 00 00 f3 a5 4c
> >> [  109.674610] RIP  [<ffffffffa048ca5a>]
> >> transport_allocate_data_tasks+0x1af/0x2cc [target_core_mod]
> >> [  109.674618]  RSP<ffff880187ccdcd0>
> >> [  109.674620] CR2: 000000000000000c
> >> [  109.674662] ---[ end trace 263ff2d45aa35d71 ]---
> >> [  203.639102] iSCSI Login timeout on Network Portal
> >> 2001:8b0:49:1:21e:67ff:fe14:69f4:3260
> >>
> >> The crash happens here:
> >>
> >> (gdb) list *transport_allocate_data_tasks+0x1af
> >> 0x11a5a is in transport_allocate_data_tasks
> >> (drivers/target/target_core_transport.c:4142).
> >> 4137
> >> 4138                    task_size = task->task_size;
> >> 4139
> >> 4140                    /* Build new sgl, only up to task_size */
> >> 4141                    for_each_sg(task->task_sg, sg,
> >> task->task_sg_nents, count) {
> >> 4142                            if (cmd_sg->length>  task_size)
> >> 4143                                    break;
> >> 4144
> >> 4145                            *sg = *cmd_sg;
> >> 4146                            task_size -= cmd_sg->length;
> >>
> >> Unfortunately I'm no kernel expert, so I can't debug any further than this.
> > It looks like we are running into some breakage with the recent v4.1 I/O
> > task generation code changes and a certain SCSI CDB size combination
> > being handled by iscsi-target (grover CC'ed).
> >
> > As this new code is currently lacking debug output, the easiet way to
> > debug this with your current build would be making a wireshark packet
> > capture from the initiator side so we can determine which CDB is
> > actually trigger the bug here..
> >
> > Proper pr_debug() out messages need to be added for these codepaths to
> > handle this types of cases w/o a recompile, but that is going to take a
> > little more time.  ;)
> >
> > Also, the output from the following quick and dirty debug patch would be
> > helpful to pinpoint the issue as well.
> >
> > Thanks for your bug report Chris!
> >
> > --nab
> [snip]
> 
> Nicholas,
> 
> First of all, I should mention I tried to replicate the BUG on a 
> different KVM virtual machine, with the same kernel and accessing the 
> same storage with a virtio-block device. I couldn't replicate the bug on 
> the VM, though it's very easy to trigger on my iSCSI box and KVM host.
> 
> With your patch applied, I get different behaviour but no BUG. Without 
> further ado, here are the messages logged on the server:
> 
> [  267.452296] bio: create slab <bio-1> at 1
> [  278.361145] cmd->data_length: 4096, sectors: 8, task_count: 1
> [  278.377229] task[0]: task_lba: 0 task_sectors: 8 task_size: 4096
> [  278.393720] Setup task->task_sg: ffff8802253f88e0 task->task_sg_nents: 1
> [  278.410983] sg[0] Remaining task_size: 4096 cmd_sg: ffff8802253f88c0 
> cmd_sg->length: 4096
> [  278.429978] Bogus SGL at           (null)
> [  339.991562] cmd->data_length: 4096, sectors: 8, task_count: 1
> [  340.007700] task[0]: task_lba: 0 task_sectors: 8 task_size: 4096
> [  340.024201] Setup task->task_sg: ffff8802253f8ce0 task->task_sg_nents: 1
> [  340.041731] sg[0] Remaining task_size: 4096 cmd_sg: ffff8802253f8c80 
> cmd_sg->length: 4096
> [  340.061113] Bogus SGL at           (null)
> [  401.976353] cmd->data_length: 4096, sectors: 8, task_count: 1
> [  401.993067] task[0]: task_lba: 0 task_sectors: 8 task_size: 4096
> [  402.010057] Setup task->task_sg: ffff88021f448760 task->task_sg_nents: 1
> [  402.027834] sg[0] Remaining task_size: 4096 cmd_sg: ffff8802253f8b40 
> cmd_sg->length: 4096
> [  402.047406] Bogus SGL at           (null)
> 
> Note that this repeats at regular intervals. The target logs in and the 
> error triggers as soon as it tries to read the partition table. The 
> initiator (in this case Open-iSCSI rather than Windows which I have been 
> testing with before) just hangs, I'm assuming it's constantly re-trying 
> the read.
> 

Hmmm, I think the quick debug patch was slightly off here, as we can
expect a NULL sg_next() in the normal case.  Could you try moving the
following debug check that your logs are showing above from the bottom
of the loop:

                        cmd_sg = sg_next(cmd_sg);
+                       if (!cmd_sg || !cmd_sg->length) {
+                               printk("Bogus SGL at %p\n", cmd_sg);
+                               return -EINVAL;
+                       }

to the top of the loop right before the printk() like so:

                /* Build new sgl, only up to task_size */
                for_each_sg(task->task_sg, sg, task->task_sg_nents, count) {
+			if (!cmd_sg || !cmd_sg->length) {
+				printk("Bogus SGL at %p\n", cmd_sg);
+				return -EINVAL;
+			}
+                       printk("sg[%d] Remaining task_size: %d cmd_sg: %p cmd_sg->length: %d\n",
+                               count, task_size, cmd_sg, cmd_sg->length);
+

and try retesting with this..?

Also, having that wireshark capture would be extremely helpful.  ;)

Thanks,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html