On Fri, 19 Dec 2008 16:07:51 +0100 Tomasz Chmielewski <mangoo@xxxxxxxx> wrote: > Tomasz Chmielewski schrieb: > > FUJITA Tomonori schrieb: > >> On Wed, 17 Dec 2008 18:51:41 +0100 > >> Tomasz Chmielewski <mangoo@xxxxxxxx> wrote: > >> > >>> FUJITA Tomonori schrieb: > >>> > >>>> Can you try one more time with this patch (including the previous > >>>> patch so please do git-reset --hard first). > >>> Here you are: > >>> > >>> Dec 17 18:46:27 megathecus tgtd: Target daemon logger with pid=21048 > >>> started! > >> > >> Thanks a lot! This is very useful. > >> > >> Can you try this again? Even if tgtd doesn't crash, please send the > >> log. > > > > I couldn't crash it any more. > > > > Here is a lengthy log for 25 minutes - with several device > > suspends/resumes. > > I'll give it 2 more hours testing... > > Not 2 hours, but some observations - is it possible that tgtd serves > wrong data to the initiator when the access to the media is slow? tgtd doesn't return wrong (bogus) data. The initiator has to give up I/O requests at some point if the target doesn't send responses for long time. Then you see the I/O errors. Or on the target side, the kernel gives tgtd I/O errors if the backing store doesn't return responses for long time. Then tgtd sends the errors to the initiator. Try scsi_debug if you want to see I/O errors due to timeout: vine:~# modprobe scsi_debug opts=4 vine:~# cd /sys/module/scsi_debug/parameters vine:/sys/module/scsi_debug/parameters# echo 1 > every_nth Then try to read the scsi_debug device. In my case: vine:~# lsscsi [0:0:0:0] disk IBM 1814 FAStT 2916 /dev/sda [0:0:0:1] disk IBM 1814 FAStT 2916 /dev/sdb [0:0:0:2] disk IBM 1814 FAStT 2916 /dev/sdc [0:0:0:31] disk IBM Universal Xport 2916 - [1:0:0:0] disk Linux scsi_debug 0004 /dev/sdd vine:~# dd if=/dev/sdd of=/dev/null count=1 If you wait for some time, then you see I/O errors like: dd: reading `/dev/sdd': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 90.2521 seconds, 0.0 kB/s And you see something like this in the kernel log: sd 1:0:0:0: Device offlined - not ready after error recovery sd 1:0:0:0: [sdd] Unhandled error code<6>sd 1:0:0:0: [sdd] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT,SUGGEST_OK end_request: I/O error, dev sdd, sector 0 Buffer I/O error on device sdd, logical block 0 Buffer I/O error on device sdd, logical block 1 Buffer I/O error on device sdd, logical block 2 Buffer I/O error on device sdd, logical block 3 sd 1:0:0:0: rejecting I/O to offline device Buffer I/O error on device sdd, logical block 0 You can replace dd with tgtd in the above example. > During the suspend/resume cycle on the target and dd if=/dev/iscsi_disk > on the initiator, I got some errors: > > # dd if=/dev/sdb of=/dev/null > dd: reading `/dev/sdb': Input/output error > 1506128+0 records in > 1506128+0 records out > 771137536 bytes (771 MB) copied, 555.773 seconds, 1.4 MB/s > > # dmesg > <lots of connection errors / host resets> > connection1:0: iscsi: detected conn error (1011) > iscsi: host reset succeeded > connection2:0: iscsi: detected conn error (1011) > iscsi: host reset succeeded > sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00 > end_request: I/O error, dev sdb, sector 1506384 > Buffer I/O error on device sdb, logical block 188298 > Buffer I/O error on device sdb, logical block 188299 > Buffer I/O error on device sdb, logical block 188300 > Buffer I/O error on device sdb, logical block 188301 > Buffer I/O error on device sdb, logical block 188302 > Buffer I/O error on device sdb, logical block 188303 > Buffer I/O error on device sdb, logical block 188304 > Buffer I/O error on device sdb, logical block 188305 > Buffer I/O error on device sdb, logical block 188306 > Buffer I/O error on device sdb, logical block 188307 > sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00 > end_request: I/O error, dev sdb, sector 1506136 > sd 3:0:0:1: [sdb] Result: hostbyte=0x02 driverbyte=0x00 > end_request: I/O error, dev sdb, sector 1506128 > connection2:0: iscsi: detected conn error (1011) > iscsi: host reset succeeded > > > > I had the exactly same issue (I/O errors) when tgtd: > - is accessing a DRBD device (a device replicated over network) > - DRBD device is being synchronized (over a slow, 2 Mbit internet link), > using the whole bandwidth > - any writes (needing replication over network) done by the initiator > will be very slow, because of the ongoing sync and slow link You need longer timeout on the initiator side or target side, or both. -- To unsubscribe from this list: send the line "unsubscribe stgt" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html