Re: question on block-layer timeout change

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Mike Anderson wrote:
Shi, Harris <Harris.Shi@xxxxxxx> wrote:
Information from /var/log/messages:
===================================
Dec 17 15:58:14 timon kernel: sd 6:0:0:2: [sdd] Sense Key : Recovered Error [current]
Dec 17 15:58:14 timon kernel: sd 6:0:0:2: [sdd] <<vendor>> ASC=0x95 ASCQ=0x1ASC=0x95 ASCQ=0x1
Dec 17 15:58:25 timon kernel:  connection2:0: ping timeout of 15 secs expired, last rx 19237, last ping 20487, now 24237
Dec 17 15:58:25 timon kernel:  connection2:0: detected conn error (1011)
Dec 17 15:58:26 timon iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3)



Information from Serial output:
===============================
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index1/shared_cpu_map
Modules linked in: radeon drm agpgart crc32c libcrc32c ib_iser rdma_cm ib_cm nfs iw_cm lockd ib_sa ib_mad nfs_acl ib_core i6
IP: [<c011a274>] __ticket_spin_lock+0x8/0x19
*pdpt = 00000000319fe001 *pde = 0000000000000000
BUG: unable to handle kernel NULL pointer dereference at 00000086
IP: [<c011a274>] __ticket_spin_lock+0x8/0x19
*pdpt = 0000000000546001 *pde = 0000000000000000
 ipv6 af_packet microcode fuse loop dm_mod mptctl e1000 iTCO_wdt sr_mod video iTCO_vendor_support e752x_edac output shpchp ]

Pid: 0, comm: swapper Not tainted (2.6.28-rc8-test-1-pae #1) PowerEdge 2850
EIP: 0060:[<c011a274>] EFLAGS: 00010086 CPU: 3
EIP is at __ticket_spin_lock+0x8/0x19
EAX: 00000086 EBX: f10f6380 ECX: f20b5400 EDX: 00000100
ESI: f18223b0 EDI: 00000000 EBP: f38a5e78 ESP: f38a5e78
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process swapper (pid: 0, ti=f38a4000 task=f38a2fd0 task.ti=f38a4000)
Stack:
 f38a5e80 c0328e0f f38a5e98 f9298389 00000002 f10f6380 f18223b0 00000000
 f38a5ea4 f7e13396 f11b9300 f38a5eb0 c0212539 f11b9300 f38a5ed4 c02125f2
 f18225b8 00000282 f389c000 f18224f4 00000100 f389c000 c0212573 f38a5f08
Call Trace:
 [<c0328e0f>] ? _spin_lock+0x15/0x18
 [<f9298389>] ? iscsi_eh_cmd_timed_out+0x24/0xb0 [libiscsi]
 [<f7e13396>] ? scsi_times_out+0x35/0x61 [scsi_mod]
 [<c0212539>] ? blk_rq_timed_out+0xc/0x46

I could not match my listing exactly with this output, but it appears that
the session is NULL when we call into iscsi_eh_cmd_timed_out. An addr2line
would help verify the iscsi_eh_cmd_timed_out line.

I added Mike C to the email cc for possible comments on the error messages
displayed above and if that would lead to cleanup of structures referenced
in iscsi_eh_cmd_timed_out.


Sorry for the late reply. I have been on vacation.

The iscsi error message just indicates that the initiator tried to send a iscsi ping and it did not get a response, so the initiator dropped the session. The error was reported as a generic connection error (1011) and when the error was fired the initiator was in the logged in / full feature phase (this basically means normal old use and nothing special).

Due to the sles use and MPP driver, I am not sure what exactly is running, but for this code path the iscsi driver does this:

        cls_session = starget_to_session(scsi_target(scmd->device));
        session = cls_session->dd_data;

to get the session in iscsi_eh_cmd_timed_out (we do this in all kernels do that has not changed).

The session pointers are only changed when a session is destroyed, and that only happens if you do a logout of the session (iscsiadm -m .... -u), and at that time when the session is destroyed we should have flushed all IO.

Are you guys doing a logout of the session with iscsiadm at this time?

Does MPP clone commands and is it doing something with the command's pointers to the device?

Another possibilty is that commands are not getting cleaned up correctly. When you see this "connection2:0: detected conn error (1011)", the driver is going to kill all outstadnding commands and and call scsi_done on them to requeue them with the scsi layer, so we should be getting any commands timed out after you see that message (maybe only in some race case where the session's commands are getting flushed at the exact same time the scsi eh was firing). There was no time stamp on the oops output but I doubt this happened. Did the oops happen after the conn error message though?
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux