Re: tgtd segfault during heavy I/O

Kiefer Chang <zapchang@xxxxxxxxx> · Tue, 5 Jul 2011 23:28:08 +0800

I tried to reproduce the symptom today, the following dead point was
seen several times.

[New Thread 0x2aabcae77940 (LWP 25576)]
[New Thread 0x2aabcb878940 (LWP 25606)]
[New Thread 0x2aabcc279940 (LWP 25610)]
[New Thread 0x2aabccc7a940 (LWP 25611)]
[New Thread 0x2aabcd67b940 (LWP 25612)]
[New Thread 0x2aabce07c940 (LWP 25983)]
[New Thread 0x2aabcea7d940 (LWP 25989)]
[New Thread 0x2aabcf47e940 (LWP 25990)]
[New Thread 0x2aabcfe7f940 (LWP 25991)]
[New Thread 0x2aabd0880940 (LWP 26017)]
[New Thread 0x2aabd1281940 (LWP 26018)]
[New Thread 0x2aabd1c82940 (LWP 26019)]
[New Thread 0x2aabd2683940 (LWP 26020)]
[New Thread 0x2aabd3084940 (LWP 26097)]
[New Thread 0x2aabd3a85940 (LWP 26112)]
[New Thread 0x2aabd4486940 (LWP 26113)]
[New Thread 0x2aabd4e87940 (LWP 26114)]
[New Thread 0x2aabd5888940 (LWP 26135)]
[New Thread 0x2aabd6289940 (LWP 26136)]
[New Thread 0x2aabd6c8a940 (LWP 26137)]
[New Thread 0x2aabd768b940 (LWP 26138)]

Program received signal SIGSEGV, Segmentation fault.
0x000000000041c5b7 in abort_task_set (mreq=0x115eef00,
target=0x103be510, itn_id=2478, tag=805306479, lun=0x0,
    all=0) at target.c:1155
1155                            list_for_each_entry_safe(cmd, tmp,
list, c_hlist) {
(gdb) bt
#0  0x000000000041c5b7 in abort_task_set (mreq=0x115eef00,
target=0x103be510, itn_id=2478, tag=805306479,
    lun=0x0, all=0) at target.c:1155
#1  0x000000000041c7ee in target_mgmt_request (tid=21440, itn_id=2478,
req_id=277390720, function=13,
    lun_buf=0x1088a588 "", tag=805306479, host_no=0) at target.c:1202
#2  0x00000000004085be in iscsi_tm_execute (task=0x1088a580) at
iscsi/iscsid.c:1431
#3  0x0000000000408755 in iscsi_task_execute (task=0x1088a580) at
iscsi/iscsid.c:1480
#4  0x0000000000408b04 in iscsi_task_queue (task=0x1088a580) at
iscsi/iscsid.c:1557
#5  0x000000000040927b in iscsi_task_rx_done (conn=0x11910c88) at
iscsi/iscsid.c:1698
#6  0x000000000040a323 in iscsi_rx_handler (conn=0x11910c88) at
iscsi/iscsid.c:2114
#7  0x0000000000411ba6 in iscsi_tcp_event_handler (fd=428, events=1,
data=0x11910c88) at iscsi/iscsi_tcp.c:158
#8  0x0000000000417365 in event_loop () at tgtd.c:454
#9  0x0000000000417a16 in main (argc=1, argv=0x7fffa114a938) at tgtd.c:640
(gdb)

(gdb) p i
$1 = 6
(gdb) print  ARRAY_SIZE(itn->cmd_hash_list)
No symbol "ARRAY_SIZE" in current context.
(gdb) p cmd
$2 = (struct scsi_cmd *) 0x5287f1ab8a18d390
(gdb) p list
$3 = (struct list_head *) 0x10580ba0
(gdb) p tmp
$4 = (struct scsi_cmd *) 0x5287f1ab8a18d390
(gdb) p cmd->dev
Cannot access memory at address 0x5287f1ab8a18d3c0
(gdb) p list
$5 = (struct list_head *) 0x10580ba0
(gdb) p itn
$6 = (struct it_nexus *) 0x10580b30

==
The system log can be downloaded from the following URL:
http://dl.dropbox.com/u/8354750/tgtd/20110705/reproduce_02/messages.zip

Thanks a lot.
Kiefer Chang

2011/7/4 Kiefer Chang <zapchang@xxxxxxxxx>:
> Dear Tomonori,
>
> We got segfault error on heavy I/O. Hope you can give some suggestion.
>
> [Setting]
> 7 machines, each machine runs a VM and each VM uses 10 targets on
> tgtd. Machine equips 1GB cards.
> So there will be at least 70+ volumes on tgtd.
>
> The tgtd (1.0.16) is running on a machine with two 10GBe cards bonded.
> For setting up backing store of target, LVM logical volumes are used.
> (Physical volume is on software RAID 5)
>
> Both initiator side and target side are running CentOS 5.4.
>
> I tried to setting up the system so core-dump can be generated when
> problem hit. The core dump file seems incomplete, file is 8G+ bigger,
> but only use about 30~50M disk capacity.
>
> So I try to use gdb to attach to a debug build (make DEBUG=1) of tgtd.
> (The symptom is much easier to be reproduced during heavy I/O test and
> with optimized build of tgtd (-o2).)
> When symptom shows, I got the following backtraces: (only the latest
> part is pasted)
> ============
> ..
> [New Thread 0x2aabbaa5d940 (LWP 20176)]
> [New Thread 0x2aabbb45e940 (LWP 20177)]
> [New Thread 0x2aabbbe5f940 (LWP 20227)]
> [New Thread 0x2aabbc860940 (LWP 20228)]
> [New Thread 0x2aabbd261940 (LWP 20229)]
> [New Thread 0x2aabbdc62940 (LWP 20230)]
> [New Thread 0x2aabbe663940 (LWP 20258)]
> [New Thread 0x2aabbf064940 (LWP 20259)]
> [New Thread 0x2aabbfa65940 (LWP 20265)]
> [New Thread 0x2aabc0466940 (LWP 20266)]
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1524
> 1524                    if (task->tag == req->itt)
> (gdb) bt
> #0  0x000000000040889d in iscsi_data_out_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1524
> #1  0x0000000000409360 in iscsi_task_rx_start (conn=0x10f26028) at
> iscsi/iscsid.c:1729
> #2  0x0000000000409d42 in iscsi_rx_handler (conn=0x10f26028) at
> iscsi/iscsid.c:1986
> #3  0x0000000000411ba6 in iscsi_tcp_event_handler (fd=445, events=5,
> data=0x10f26028) at iscsi/iscsi_tcp.c:158
> #4  0x0000000000417365 in event_loop () at tgtd.c:454
> #5  0x0000000000417a16 in main (argc=1, argv=0x7fffd5eb9a98) at tgtd.c:640
> (gdb)
>
> (gdb) print task
> $5 = (struct iscsi_task *) 0xffffffffffffff90
> (gdb) print req
> $6 = (struct iscsi_data *) 0x10f26148
> (gdb)
>
> (gdb) p task->req
> Cannot access memory at address 0xffffffffffffff90
> (gdb) p task->rsp
> Cannot access memory at address 0xffffffffffffffc0
> (gdb) p task->tag
> Cannot access memory at address 0xfffffffffffffff0
>
>
> (gdb) p req->opcode
> $30 = 5 '\005'
> (gdb) p req->flags
> $31 = 128 '\200'
> (gdb) p req->rsvd2
> $32 =   "\000"
>
> ============
>
> The system log can be downloaded from here:
> http://dl.dropbox.com/u/8354750/tgtd/20110704/messages
>
> Seems *task* is freed and referenced again.
> Hope I can get some feedback.
> Thanks a lot.
>
> --
> Kiefer Chang
>
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html