Re: help tgt segfault

FUJITA Tomonori <fujita.tomonori@xxxxxxxxxxxxx> · Tue, 16 Dec 2008 20:19:16 +0900

On Mon, 15 Dec 2008 10:48:05 +0100
Tomasz Chmielewski <mangoo@xxxxxxxx> wrote:

> FUJITA Tomonori schrieb:
> > On Thu, 11 Dec 2008 07:58:30 -0800
> > "Jesse Nelson" <spheromak@xxxxxxxxx> wrote:
> > 
> >> were running vanila 2.6.27.4 kern with  tgt 0.9.2   with about 30
> >> targets and about 10mb/s throughput
> >> i am constantly (daily) seeing tgtd segfault. no real deep info just
> >> this error in the logs:
> >>     segfault at 8 ip 000000000040ebed sp 00007fffb259cb30 error 6 in
> >> tgtd[400000+23000]
> >> any ideas or suggestions how i can dig deeper here ?
> > 
> > Can you run gdb with tgtd?
> > 
> > If you can't, can you give the very detailed information about what
> > you are doing, which enable me to do the same thing you do to
> > reproduce the problem.
> 
> I'm seeing those occasionally too (one tgtd process dies), but rather *very* rarely.
> 
> It doesn't seem to depend on load type, number or connected/working initiators,
> configured targets etc. and I'm not sure how to reproduce it.
> 
> One thing that comes to my mind is that one tgtd process dies when initiator wants 
> to read data and tgtd can't "deliver" it immediately (i.e., I/O "frozen" because of 
> SATA resets/exceptions/timeouts). It doesn't happen always on such SATA timeouts and 
> is therefore hard to reproduce.

TMF (an initiator tries to abort a request due to timeout) might be
related with your problem. I'll dig into it this weekend.

> Look at this log - tgtd segfaulted just after SATA timeouts (after ~50 days of working properly).
> This happened with tgtd version fetched on 2008-Oct-24, running on x86, 
> with just two initiators connected, load to one target was perhaps about 5 MB/s,
> to the second target was close to 0 MB/s.
> 
> Dec 11 21:57:37 megathecus kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Dec 11 21:57:37 megathecus kernel: ata4.00: cmd 25/00:00:bf:78:1f/00:02:14:00:00/e0 tag 0 dma 262144 in
> Dec 11 21:57:37 megathecus kernel:          res 40/00:01:01:4f:c2/40:00:15:00:00/00 Emask 0x4 (timeout)
> Dec 11 21:57:37 megathecus kernel: ata4.00: status: { DRDY }
> Dec 11 21:57:37 megathecus kernel: ata4: soft resetting link
> Dec 11 21:57:37 megathecus kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Dec 11 21:57:37 megathecus kernel: ata4.00: configured for UDMA/133
> Dec 11 21:57:37 megathecus kernel: ata4: EH complete
> Dec 11 21:58:07 megathecus kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> Dec 11 21:58:07 megathecus kernel: ata4.00: cmd 25/00:00:bf:78:1f/00:02:14:00:00/e0 tag 0 dma 262144 in
> Dec 11 21:58:07 megathecus kernel:          res 40/00:01:01:4f:c2/40:00:15:00:00/00 Emask 0x4 (timeout)
> Dec 11 21:58:07 megathecus kernel: ata4.00: status: { DRDY }
> Dec 11 21:58:07 megathecus kernel: ata4: soft resetting link
> Dec 11 21:58:07 megathecus kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
> Dec 11 21:58:07 megathecus kernel: ata4.00: configured for UDMA/133
> Dec 11 21:58:07 megathecus kernel: ata4: EH complete
> Dec 11 21:58:08 megathecus kernel: tgtd[2567]: segfault at 00000220 eip 0804f0b5 esp 77abdac0 error 4
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] 781422768 512-byte hardware sectors (400088 MB)
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Write Protect is off
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] 781422768 512-byte hardware sectors (400088 MB)
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Write Protect is off
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Mode Sense: 00 3a 00 00
> Dec 11 21:58:08 megathecus kernel: sd 4:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> 
> 
> I reported a similar issue in June 2008 - see the thread titled
> "disk kicked out of RAID -> tgtd segmentation fault":
> 
> http://lists.wpkg.org/pipermail/stgt/2008-June/thread.html#1702
> http://lists.wpkg.org/pipermail/stgt/2008-July/thread.html#1746
> 
> Can it be related somehow?

I thought that I fixed the bug in the above thread.
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html