Re: tgtd segfault with software RAID, hard resetting link

Tomasz Chmielewski <mangoo@xxxxxxxx> · Tue, 21 Apr 2009 14:06:27 +0200

FUJITA Tomonori schrieb:

So I rebooted the initiator without logging it out of the target, with 
"echo b >/proc/sysrq-trigger" (it's a diskless initiator, so basically 
that's the only method when its disks are gone).

Initiator started to boot again and I think tgtd segfaulted when the 
initiator tried to log in to the target.

Duh, seems that we have another problem.

Can you reproduce this by just rebooting the initiator without logging
out and starting the initiator again?

It seems to be harder to cause it on purpose... but yes, it's reproducible.
It may or may not be a different problem.

1) On initiator, do:

# echo 3 > /sys/block/sda/device/timeout
# echo 3 > /sys/block/sdd/device/timeout 
# dd if=/dev/zero of=/mnt/iscsi/bigfile bs=64k

2) On the target, do (drive being a part of RAID):

# i=1; while [ $i -ne 100 ] ; do echo $i; hdparm -Y /dev/sdd;  i=$((i+1)); done

3) If IO errors appear on the initiator (thi seems important), reboot it without logging out of the target:

# echo b >/proc/sysrq-trigger

4) Initiator will start booting and will connect to the target.
It won't be able to boot (hdparm loop still running on the target; some data still in cache/dirty/writeback).

Interrupt the loop, if you have luck, tgtd _may_ segfault.

--------------------

While I tried to reproduce it, I did, on the initiator (both are iSCSI disks):

# echo 3 > /sys/block/sda/device/timeout
# echo 3 > /sys/block/sdd/device/timeout 

Then, on the target:

i=1; while [ $i -ne 100 ] ; do echo $i; hdparm -Y /dev/sdd;  i=$((i+1)); done

And it segfaulted after ~30 iterations (happened only once; no initiator reboot needed):

Apr 21 13:26:07 megathecus tgtd: conn_close(100) connection closed, 0x81a791c 1
Apr 21 13:27:12 megathecus tgtd: abort_task_set(988) found 51 0
Apr 21 13:27:12 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:12 megathecus tgtd: abort_cmd(964) found 21 e
Apr 21 13:27:20 megathecus tgtd: abort_task_set(988) found 39 0
Apr 21 13:27:20 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:20 megathecus tgtd: abort_cmd(964) found 73 e
Apr 21 13:27:41 megathecus tgtd: conn_close(100) connection closed, 0x81a791c 3
Apr 21 13:27:41 megathecus tgtd: conn_close(106) sesson 0x81a7d70 1
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000051 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000041 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 10000050 0
Apr 21 13:27:47 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000043 e
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000040 e
Apr 21 13:27:47 megathecus tgtd: abort_cmd(964) found 10000045 e
Apr 21 13:27:58 megathecus tgtd: abort_task_set(988) found 10000072 0
Apr 21 13:27:58 megathecus tgtd: abort_task_set(988) found 0 0
Apr 21 13:27:58 megathecus tgtd: abort_cmd(964) found 1000007e e
Apr 21 13:28:07 megathecus tgtd: conn_close(100) connection closed, 0x81a700c 4
Apr 21 13:28:07 megathecus tgtd: conn_close(106) sesson 0x81a71f0 1
Apr 21 13:28:09 megathecus kernel: tgtd[21360]: segfault at 0 ip 080546d6 sp 6cc1c340 error 4 in tgtd[8048000+24000]

--
Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html