Re: Possible bug in open-iSCSI

Konrad Rzeszutek <konrad@xxxxxxxxxxxxxxx> · Tue, 10 Mar 2009 10:53:05 -0400

On Tue, Mar 10, 2009 at 12:34:55PM +0530, sushrut shirole wrote:
> 
> Hi All,

Hey Sushrut,

I am also cross-posting my response to the linux-scsi mailing list
in case they have insight in this problem.

> 	I am currently guiding few students who are working on unh-iSCSI
> target. Currently we are simulating some faults at a target side .
> Like we are adding an error injection module to unh-iSCSI , so that
> one can test how initiator behaves on particular error .
> 	as a part of it we injects a fault in report LUN size . where we
> report a wrong LUN size . ( Suppose a LUN is of size 2 gb we report it
> as a 4 gb ).(Microsoft and open-iSCSI initiators we are using ).When
> we try formatting this LUN on open-iSCSI initiator it formats this LUN
> . In fact it doesn't give any error when we try to read or lseek 4gb
> of data . But on Microsoft initiator we get an error when we try to
> format this LUN . So is this a bug of open-iSCSI or this is bug of
> read lseek ?

The Open-iSCSI does not investigate any SCSI commands (except the AEN which
gets is own special iSCSI PDU header).

What you are looking at is the SCSI middle-layer, or the block-device layer,
or the target not reporting an error, at being potentially faulty.
What Linux kernel does when you lseek to a location past 2GB and do a read,
is to transmute the request to a SCSI READ command.

That SCSI READ command (you can see what the fields look like when you
capture it under ethereal) specifies what sector it wants. Open-iSCSI
wraps that SCSI command in its own header and puts it in a TCP packet
destined to the target. The the target should then report a failure
(sending a SCSI SENSE value reporting a problem). Now it might be that SCSI
middle layer doesn't understand that error condition and passes it on as OK.
Or it might be that the target doesn't report a failure and returns garbage/null data.

What I would suggest is to do a comparison. Create a test setup where you
have a real 4GB LUN, do a lseek/read above 2GB and capture all of that
traffic using wireshark/ethereal. Then do the same test but with a 2GB LUN
that looks like a 4GB and see what the traffic looks like.

If it looks the same then somehow the target isn't reporting the right
error. Which implies that when Microsoft formats the disks they verify it - by
rereading the data they wrote in and failing if the doesn't match. Which might
not be what mkfs.ext3 does under Linux - look in the man-page to find out. But
by using lseek/read (or just do a dd with the skip argument - look in the manpage
for more details) a couple of times on the same sector and you should see
different data as well.

If the TCP dump looks different, and the target reports a error and the Linux kernel
doesn't do anything then it is time to dig through the code (scsi_error.c) to find
why Linux doesn't see it as. Make sure you do use the latest kernel thought - which as of
today is 2.6.29-rc7-git3. And if you do find the problem post a patch
on the linux-scsi mailing list.

> 
> --
> Thanks,

Hope this lengthy explanation helps in your endeavor.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html