Help pinpointing cause of tgtd instability

John Pletka <jpletka@xxxxxxxxxxx> · Thu, 1 Dec 2011 10:32:35 -0500

I have two NAS devices running an almost identical workload.  One of
them has been perfectly stable for over a year now.  On the other,
tgtd either aborts, or causes the iscsi mounted file systems to go
into read-only mode about once a week.  I wanted to lay out my
configuration to see if there is a most-likely cause.  One thing that
stands out is the scsi-target-utils version is 1.0.4 on the unstable
server, and 1.0.8 on the stable server.  yum update on CentOS 6 says
1.0.4 is the most recent though and I see patches through Jan 17,
2011.  Other potential causes -- bonded ethernet ports on the unstable
one, and no swap partition on the unstable one (the OS is installed on
a compact-flash card).

>From the abrt logs:
Process /usr/sbin/tgtd was killed by signal 11 (SIGSEGV)
Which <might> be related to this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=712807

Any insight would be appreciated -- thanks in advance.

On the STABLE server:
================================
Kernel: Linux version 2.6.18-238.12.1.el5
OS: CentOS release 5.6 (Final)
Physical RAM: 4G
SWAP: 6G
scsi-target-utils-1.0.8-0.el5_6.1
iscsi-initiator-utils-6.2.0.872-6.el5
Disk Array is Soft-RAID10
ISCSI volumes are sparse files on a XFS file system
Operating system is installed on an ext3 partition on the main disk array
Single ethernet port hosts main IP

On the UNSTABLE server:
===============================
Kernel: Linux version 2.6.32-71.29.1.el6.x86_64
OS: CentOS Linux release 6.0 (Final)
Physical RAM: 8G
SWAP: None
scsi-target-utils-1.0.4-3.el6_0.1.x86_64
iscsi-initiator-utils-6.2.0.872-10.el6.x86_64
Disk array is Soft-RAID6
ISCSI volumes are sparse files on a XFS file system
Operating system is installed on compact-flash card (not part of the
data disk array)
Two ethernet ports are bonded to host the main IP

Typical log file entry in /var/log/messages
=================================
Nov 30 11:14:51 san2 tgtd: conn_close(100) connection closed, 0x20786d8 1
Nov 30 11:14:51 san2 tgtd: conn_close(106) sesson 0x20db690 1
Nov 30 13:43:42 san2 kernel: tgtd[19686]: segfault at 0 ip
0000000000415edf sp 00007fff10009f30 error 4 in tgtd
(deleted)[400000+2f000]
Nov 30 13:43:43 san2 tgtd: abort_task_set(1008) found 9 0
Nov 30 13:46:57 san2 abrt[26802]: file /usr/sbin/tgtd seems to be deleted
Nov 30 13:47:28 san2 abrt[26802]: saved core dump of pid 19686
(/usr/sbin/tgtd) to /var/spool/abrt/ccpp-1322678817-19686.new/coredump
(649093120 bytes)
Nov 30 13:47:28 san2 abrtd: Directory 'ccpp-1322678817-19686' creation detected
Nov 30 13:47:28 san2 abrtd: Size of '/var/spool/abrt' >= 1000 MB,
deleting 'ccpp-1320310886-6236'
Nov 30 13:47:32 san2 abrtd: New crash
/var/spool/abrt/ccpp-1322678817-19686, processing
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html