Re: help tgt segfault

FUJITA Tomonori <fujita.tomonori@xxxxxxxxxxxxx> · Sat, 20 Dec 2008 00:51:51 +0900

On Fri, 19 Dec 2008 16:07:51 +0100
Tomasz Chmielewski <mangoo@xxxxxxxx> wrote:

> Tomasz Chmielewski schrieb:
> > FUJITA Tomonori schrieb:
> >> On Wed, 17 Dec 2008 18:51:41 +0100
> >> Tomasz Chmielewski <mangoo@xxxxxxxx> wrote:
> >>
> >>> FUJITA Tomonori schrieb:
> >>>
> >>>> Can you try one more time with this patch (including the previous
> >>>> patch so please do git-reset --hard first).
> >>> Here you are:
> >>>
> >>> Dec 17 18:46:27 megathecus tgtd: Target daemon logger with pid=21048 
> >>> started!
> >>
> >> Thanks a lot! This is very useful.
> >>
> >> Can you try this again? Even if tgtd doesn't crash, please send the
> >> log.
> > 
> > I couldn't crash it any more.
> > 
> > Here is a lengthy log for 25 minutes - with several device 
> > suspends/resumes.
> > I'll give it 2 more hours testing...
> 
> Not 2 hours, but some observations - is it possible that tgtd serves 
> wrong data to the initiator when the access to the media is slow?

tgtd doesn't return wrong (bogus) data.

The initiator has to give up I/O requests at some point if the target
doesn't send responses for long time. Then you see the I/O errors.

Or on the target side, the kernel gives tgtd I/O errors if the backing
store doesn't return responses for long time. Then tgtd sends the
errors to the initiator.

Try scsi_debug if you want to see I/O errors due to timeout:

vine:~# modprobe scsi_debug opts=4
vine:~# cd /sys/module/scsi_debug/parameters
vine:/sys/module/scsi_debug/parameters# echo 1 > every_nth

Then try to read the scsi_debug device. In my case:

vine:~# lsscsi
[0:0:0:0]    disk    IBM      1814      FAStT  2916  /dev/sda
[0:0:0:1]    disk    IBM      1814      FAStT  2916  /dev/sdb
[0:0:0:2]    disk    IBM      1814      FAStT  2916  /dev/sdc
[0:0:0:31]   disk    IBM      Universal Xport  2916  -
[1:0:0:0]    disk    Linux    scsi_debug       0004  /dev/sdd

vine:~# dd if=/dev/sdd of=/dev/null count=1

If you wait for some time, then you see I/O errors like:

dd: reading `/dev/sdd': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 90.2521 seconds, 0.0 kB/s

And you see something like this in the kernel log:

sd 1:0:0:0: Device offlined - not ready after error recovery
sd 1:0:0:0: [sdd] Unhandled error code<6>sd 1:0:0:0: [sdd] Result:
hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK
end_request: I/O error, dev sdd, sector 0
Buffer I/O error on device sdd, logical block 0
Buffer I/O error on device sdd, logical block 1
Buffer I/O error on device sdd, logical block 2
Buffer I/O error on device sdd, logical block 3
sd 1:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sdd, logical block 0

You can replace dd with tgtd in the above example.

> During the suspend/resume cycle on the target and dd if=/dev/iscsi_disk 
> on the initiator, I got some errors:
> 
> # dd if=/dev/sdb of=/dev/null
> dd: reading `/dev/sdb': Input/output error
> 1506128+0 records in
> 1506128+0 records out
> 771137536 bytes (771 MB) copied, 555.773 seconds, 1.4 MB/s
> 
> # dmesg
> <lots of connection errors / host resets>
>   connection1:0: iscsi: detected conn error (1011)
> iscsi: host reset succeeded
>   connection2:0: iscsi: detected conn error (1011)
> iscsi: host reset succeeded
> sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00
> end_request: I/O error, dev sdb, sector 1506384
> Buffer I/O error on device sdb, logical block 188298
> Buffer I/O error on device sdb, logical block 188299
> Buffer I/O error on device sdb, logical block 188300
> Buffer I/O error on device sdb, logical block 188301
> Buffer I/O error on device sdb, logical block 188302
> Buffer I/O error on device sdb, logical block 188303
> Buffer I/O error on device sdb, logical block 188304
> Buffer I/O error on device sdb, logical block 188305
> Buffer I/O error on device sdb, logical block 188306
> Buffer I/O error on device sdb, logical block 188307
> sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00
> end_request: I/O error, dev sdb, sector 1506136
> sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00
> end_request: I/O error, dev sdb, sector 1506128
>   connection2:0: iscsi: detected conn error (1011)
> iscsi: host reset succeeded
> 
> 
> 
> I had the exactly same issue (I/O errors) when tgtd:
> - is accessing a DRBD device (a device replicated over network)
> - DRBD device is being synchronized (over a slow, 2 Mbit internet link), 
> using the whole bandwidth
> - any writes (needing replication over network) done by the initiator 
> will be very slow, because of the ongoing sync and slow link

You need longer timeout on the initiator side or target side, or both.
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html