Re: Data Digest Errors

jrepac@xxxxxxxxx · Sun, 26 Feb 2012 09:08:13 -0800 (PST)

Hi Nicholas,
I am getting fairly close to root cause on the digest issue.  Here's what I have so far:

1. The test case is launching the Windows drive format after an iSCSI session is started with data and header digest enabled.  

2. The initiator is in one outstanding command per session mode to keep things simple.

3. Added logging to the target code for the receipt of immediate data and solicited data out PDU's.
4. Added logging of min_t results and arguments inside of iscsit_do_crypto_hash_sg.  I needed this to rule out data offsets as the problem.

Here's what I observed:
1. The problem does not appear to be related to offsets.  I have seen the same hashes in terms of offsets and element lengths submitted to crypto_hash_final from the logging added.  Some pass and others fail.
2. The first 10-12 seconds of the format operation is always free of data digest errors.  All write operations fail with data digest failures after the first fails.  This involves a TCP reset, a login, and a write with a digest failure.

3. Data digest errors always occur on the first Data Out PDU following R2T.
4. No digest failures occur on immediate data.
5. Around two to three Data Out PDU's beyond the failing one are received by the target before the connection is reset.  This from the Wireshark trace taken at the target.

Analysis and Theories:

1. I am beginning to believe this problem is related to how full the pipeline is on the initiator side based on the first 10-12 seconds of no failures and then the target is permanently busted.  I tried repeating this without touching the setup of the target by logging out and then back in.  The target again is fine for 10-12 seconds.
2. Immediate Data never fails data digest checking.  This is the only time when there can be only one thing outstanding on the wire.
3. The two to three DataOut PDU's following the failed PDU are highly suspect in terms of somehow  contributing to the failures.

As an experiment, I am thinking of adding retries to the data digest calculation.  If retries work, we can rule out data corruption.

Thanks,
-Joe  

----- Original Message -----
From: Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx>
To: jrepac@xxxxxxxxx
Cc: target-devel <target-devel@xxxxxxxxxxxxxxx>
Sent: Thursday, February 16, 2012 11:51 PM
Subject: Re: Data Digest Errors

On Thu, 2012-02-16 at 08:37 -0800, jrepac@xxxxxxxxx wrote:
> Hi Nicholas,
> I think you may be getting lost in the packet sequence.

Mmmm, indeed.

I realize now that my wireshark install is *not* performing application
layer decoding of the R2T nor DataOUT data_sn=0 in question for ITT
0x00170000 and the failed data-out payload digest.

I'll assume this is because of the HW offload is enabled on the pcap
generated side, correct..?

>  I using Wireshark filter iscsi.initiatortasktag ==0x00170000  Here's
> a summary:
> 
> packet 69751 - Write command + immediate data
> packet 69755 - R2T from the target
> 
> packet 69762 - DataOut PDU => PDU with mis-detected Data Digest error
> 
> packet 69768 - Next DataOut PDU
> packet 69771 - Next DataOut PDU
> 
> If you switch to filter tcp.port==3260....
> 
> packets 69722-69779, 69844 - Target TCP ACKs previously sent data 
> 
> packet 69844 - Target resets the TCP connection (Digest error detected
> fail path)
> 
> 
> Packet 69762 matches the first detected digest error
> in /var/log/messages:
> ITT: 0x00170000, Offset: 8048, Length: 8192, DataSN: 0x00000000,
> CRC32C DataDigest 0xa8517e77 does not match computed 0x4e289762
> 
> Offset 8048            =  0x1f70
> Length 8192            =  0x2000
> DataDigest 0xa8517e77  =~ 0x77 0x7e 0x51 0xa8 (From Wireshark trace
> and reverse order presentation)
> 

So looking again at raw TCP payloads (minus wireshark iscsi protocol
decode) the 48 byte iSCSI PDUs R2T and DataOUT w/ data_sn=0 for ITT
0x00170000 are using offset 0x1f70 based on received original immediate
data payload length, and AFAICT appears to look sane on the wire.

> The initial R2T is definitely in the trace.  Either something is wrong
> in the CRC32C calculation code or the data has changed by the time the
> calculation was performed. I see little difference between a packet
> with offset 0x1f70 that fails and those with offset 0x1f60 that digest
> works on.  Puzzling!
> 

Having the OS FS+BLOCK layer change a WRITE payload after a data digest
payload has been generated is not completely unusual (in the Linux soft
iSCSI initiator world), but at least in my experience in Linux this
tends to happen more for immediate or unsolicited data digests than for
solicited data..

I've not seen any bug-reports attributed to this with software MSFT
iSCSI initiator thus far, but with HW offload it could be a possibility.
It would probably be useful to try to generate a crc32c from the
wireshark packet payload in question if possible to verify the
payload..?

I'm also happy to try to reproduce the same offset sequence with an
software linux-iscsi initiator to verify iscsi_target is working as
expected here.

Thanks,

--nab
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html