Dear Mr. Tomonori!
We got read errors usinfg iser (over infiniband) transport with stgtd (0.9.3).
I discussed this on the open-iscsi mailing list firstly.
After review of our tests I found that restarting stgt
cures the read-errors for the next access to the target.
Here is what we have done:
On Initiator writing:
ares:~# lmdd if=internal of=/dev/sdc opat=1 bs=1M count=1000 mismatch=1
1000.0000 MB in 6.3606 secs, 157.2190 MB/sec
Check on Target is fine:
athene:~# lmdd of=internal if=/dev/vg0/test ipat=1 bs=1M count=1000
mismatch=1
1000.0000 MB in 0.8849 secs, 1130.0176 MB/sec
On initiator reading:
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=1000000 want=1a0000 got=1b3000
off=1000000 want=1a0004 got=1b3004
off=1000000 want=1a0008 got=1b3008
off=1000000 want=1a000c got=1b300c
off=1000000 want=1a0010 got=1b3010
off=1000000 want=1a0014 got=1b3014
off=1000000 want=1a0018 got=1b3018
off=1000000 want=1a001c got=1b301c
off=1000000 want=1a0020 got=1b3020
off=1000000 want=1a0024 got=1b3024
1.0000 MB in 0.0064 secs, 157.2822 MB/sec
But if I restart the TGT-Daemon on the target side: Every thing is ok.
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
1000.0000 MB in 22.2695 secs, 44.9045 MB/sec
But only for the first run of lmdd! Then the error strikes reproducable
every time.
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=0 want=8ae00 got=a9e00
off=0 want=8ae04 got=a9e04
off=0 want=8ae08 got=a9e08
off=0 want=8ae0c got=a9e0c
off=0 want=8ae10 got=a9e10
off=0 want=8ae14 got=a9e14
off=0 want=8ae18 got=a9e18
off=0 want=8ae1c got=a9e1c
off=0 want=8ae20 got=a9e20
off=0 want=8ae24 got=a9e24
0.0000 MB in 0.0029 secs, 0.0000 MB/sec
ares:~# lmdd of=internal if=/dev/sdc ipat=1 bs=1M count=1000 mismatch=10
off=51000000 want=3129e00 got=3147e00
off=51000000 want=3129e04 got=3147e04
off=51000000 want=3129e08 got=3147e08
off=51000000 want=3129e0c got=3147e0c
off=51000000 want=3129e10 got=3147e10
off=51000000 want=3129e14 got=3147e14
off=51000000 want=3129e18 got=3147e18
off=51000000 want=3129e1c got=3147e1c
off=51000000 want=3129e20 got=3147e20
off=51000000 want=3129e24 got=3147e24
51.0000 MB in 0.1463 secs, 348.5702 MB/sec
How to debug further?
* I never have seen a single write corruption. Only reading is the problem.
* Switching from ISER transport to TCPoverIPoverIB no problem at all.
Since writing is no problem I do not think that the problem is related
to the infiniband layer or the RDMA itself. But is the problem on the
initiator or on the target side?
* I tried an experimental debian kernel 2.6.28 with no other findings.
* I changed the roles of initator and target - same result.
* The amount of RAM that influenced the TioTest-runs does NOT affect the
behavior of lmdd.
* The read-corruption ocures with 256M as well as with 32GB RAM.
* Number of CPUs does also not matter.Tried from one core to 8 cores.
* BIOS of the servers is set to failsafe.
* Firmware of the Mellanox cards is the actual version 1.2.0 and leaved
anchanged.
Maybe I used the wrong versions of the software packages:
I used :
Debian Lenny packages:
- open-iscsi 2.0.870~rc3-0.4
- libibverbs1 1.1.2-1
- librdmacm1 1.0.7-1
From OFED-1.3 self compiled:
libibcommon 1.1.1-1
libibumad 1.2.1-1
opensm 3.2.2
STGT self compiled
tgtd 0.9.3
against debian -dev packages
libibverbs-dev 1.1.2-1
librdmacm-dev 1.0.7-1
Any help welcome
Best regards
Volker
--
====================================================
inqbus it-consulting +49 ( 341 ) 5643800
Dr. Volker Jaenisch http://www.inqbus.de
Herloßsohnstr. 12 0 4 1 5 5 Leipzig
N O T - F Ä L L E +49 ( 170 ) 3113748
====================================================
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html