Hello
I am on kernel 2.6.36.2 (but I think I saw this problem also long time
ago on 2.6.31)
I am having problems of greatly reduced throughput to scsi devices
Commands run for about 0.3 seconds and then are hanged for about 1.7
seconds. Then run again for 0.3 seconds then are hanged again for 1.7
and so on.
This is very recognizable by looking at iostat -x 1.
In the 1.7 seconds of freeze (actually with iostat you see 1 second of
freeze) you can see a few outstanding writes. I saw either outstanding
reads+writes or outstanding writes alone but I think I never saw
outstanding reads alone. For these writes, completion is returned only
after the pause (or so says blktrace...) . I am even wondering if this
could be an interrupt serviced late...?
I saw this in two very different occasions: on 3ware with real disks,
and now on iscsi against a ramdisk.
First case was with my 3ware 9650SE controller used as a SATA controller
with MD on top of it: the problem happens with MD raid 5 doing e.g. 4k
I/O O_DIRECT on it. You can read a detailed description in linux-raid
mailing list
http://marc.info/?l=linux-raid&m=129377016700391&w=2
http://marc.info/?l=linux-raid&m=129377380202417&w=2
http://marc.info/?l=linux-raid&m=129422832802809&w=2
the last post is the most relevant, start from that one.
Initially I thought it was a bug of the 3ware controller, famous for not
being fast / capable to handle high IOPS, however it's strange the
problem does not happen on raid10 with a much higher throughput on same
controller and same disks. The situation is similar because raid-5 in
this case does not read (see last link) but it definitely only happens
in raid5, and not in raid10, i tried lots of times.
Now I am seeing the same thing with open-iscsi over IPoIB against SCST
backed by a ramdisk at the other side, so it's network, completely
different beast, no disks involved (it's a linux ramdisk), on a
completely different hardware for both client and server machines. The
only thing these two iscsi servers have in common with the storage
server (the one with the 3ware) is the type of CPUs: dual Xeon 5420.
Mainboard also is different: supermicro for both computers in this case,
tyan for the storage server.
Here is an excerpt of iostat -x 1 for the iscsi case, initiator side:
-----------------------------------------------------------------
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 100.00 0.00 0.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 184.00 0.00 0.00 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.59 0.00 10.95 67.40 0.00 21.05
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 30.00 26676.00 934.00 5773.00 18912.00 258200.00
41.32 153.98 51.05 0.15 98.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.15 83.89 0.00 15.96
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 172.00 0.00 0.00 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.78 0.00 16.96 51.33 0.00 30.93
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 62.00 22410.00 1773.00 6236.00 35696.00 235392.00
33.85 156.14 42.12 0.12 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 80.00 0.00 20.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 138.00 0.00 0.00 100.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.19 0.00 15.03 61.41 0.00 23.38
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 19.00 21617.00 806.00 5303.00 13464.00 221096.00
38.40 136.32 44.53 0.16 99.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 16.96 0.00 83.04
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00
0.00 152.00 0.00 0.00 100.00
-----------------------------------------------------------------
...off-on-off-on....
See what I mean?
"Top" says that all cpus are idle in the 1 second "off" period,
synchronously for both at initiator side and target side. I can also say
that the IPoIB network is practically stuck during transfer times, pings
between the two machines do not pass, but it's totally idle in the still
time, pings pass. At first I thought the network could have caused these
problems but TCP retransmissions do not wait exactly 1 second, it's a
too perfect synchronization to be a TCP problem. I have never seen a TCP
transfer go fullspeed for 1 second then 0 bytes/sec for another second
then fullspeed again for 1 second... it's not how TCP works.
Here is an iostat -x 1 with the real-disks on 3ware controller during:
dd if=/dev/zero of=/dev/mapper/datavg1-lvol0 bs=4K oflag=direct
(it is the same with oflag=direct,sync)
Note: I am keeping queue_depth at 1 with these controllers: they don't
behave well with ncq.
Note2: it does not pass through a filesystem.
-------------------------------------------------
sdi 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdj 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 100.00 (*)
sdm 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
...[CUT]...
sdj 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 448.00 0.00 3584.00
8.00 0.99 4.42 2.21 99.00
sdm 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 448.00 0.00 3584.00
8.00 0.03 0.07 0.07 3.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
...[CUT]...
sdj 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 100.00 (*)
sdm 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
...[CUT]...
sdj 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 384.00 0.00 3072.00
8.00 0.98 5.13 2.55 98.00
sdm 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 384.00 0.00 3072.00
8.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
...[CUT]...
sdj 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 100.00 (*)
sdm 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
sdn 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00
-------------------------------------------------
(*) <--- see the write stuck in these places... the dd write is sync so
nothing else goes on until it has returned, and it does not return for
one whole second, actually 1.7 seconds.
Now I can show you a blktrace for the real-disks on 3ware case:
Note: this is where I took the 0.3 -- 1.7 figure. For the iscsi case I
am not sure of the exact timing but it also cycles every about 2
seconds. The iscsi case drifts slowly so that after a few minutes is not
in-sync anymore with iostat, then after a few more minutes it is again
in-sync with iostat -x 1 and I could take some clear shots of the
problem like in the iostat I pasted to you.
--------------------------------------------------
8,240 2 1119 0.097304762 5319 I W [kworker/u:1]
8,240 2 1120 0.097305511 5319 C W 0 [0]
8,240 1 3333 0.097350168 1557 A W 859034104 + 8 <-
(8,241) 859032056
8,240 1 3334 0.097350336 1557 Q W 859034104 + 8 [md11_raid5]
8,240 1 3335 0.097351193 1557 G W 859034104 + 8 [md11_raid5]
8,240 1 3336 0.097352168 1557 P N [md11_raid5]
8,240 1 3337 0.097352469 1557 I W 859034104 + 8 [md11_raid5]
8,240 1 3338 0.097359991 1557 U N [md11_raid5] 1
8,240 1 3339 0.097360743 1557 D W 859034104 + 8 [md11_raid5]
8,240 1 3340 0.097363489 1557 U N [md11_raid5] 1
8,240 1 3341 0.097464999 0 C W 859034104 + 8 [0]
8,240 1 3342 0.097494707 1557 U N [md11_raid5] 0
8,240 0 1116 0.097508456 5319 Q WBS [kworker/u:1]
8,240 0 1117 0.097509015 5319 G WBS [kworker/u:1]
8,240 0 1118 0.097509674 5319 P N [kworker/u:1]
8,240 0 1119 0.097509992 5319 I W [kworker/u:1]
8,240 0 1120 0.097510835 5319 C W 0 [0]
8,240 1 3343 0.097555058 1557 A W 859034112 + 8 <-
(8,241) 859032064
8,240 1 3344 0.097555236 1557 Q W 859034112 + 8 [md11_raid5]
8,240 1 3345 0.097556090 1557 G W 859034112 + 8 [md11_raid5]
8,240 1 3346 0.097557062 1557 P N [md11_raid5]
8,240 1 3347 0.097557386 1557 I W 859034112 + 8 [md11_raid5]
8,240 1 3348 0.097564103 1557 U N [md11_raid5] 1
8,240 1 3349 0.097564957 1557 D W 859034112 + 8 [md11_raid5]
8,240 1 3350 0.097567754 1557 U N [md11_raid5] 1
8,240 1 3351 0.302256343 1557 U N [md11_raid5] 1
8,240 3 1117 1.999682436 0 C W 859034112 + 8 [0]
<--------- write completes only here after 1.7 secs
8,240 2 1121 1.999737611 5319 Q WBS [kworker/u:1]
8,240 2 1122 1.999738330 5319 G WBS [kworker/u:1]
8,240 2 1123 1.999738907 5319 P N [kworker/u:1]
8,240 2 1124 1.999739262 5319 I W [kworker/u:1]
8,240 2 1125 1.999740107 5319 C W 0 [0]
8,240 3 1118 1.999792262 1557 A W 859034120 + 8 <-
(8,241) 859032072
8,240 3 1119 1.999792413 1557 Q W 859034120 + 8 [md11_raid5]
8,240 3 1120 1.999793213 1557 G W 859034120 + 8 [md11_raid5]
8,240 3 1121 1.999794118 1557 P N [md11_raid5]
8,240 3 1122 1.999794422 1557 I W 859034120 + 8 [md11_raid5]
8,240 3 1123 1.999801199 1557 U N [md11_raid5] 1
8,240 3 1124 1.999801884 1557 D W 859034120 + 8 [md11_raid5]
8,240 3 1125 1.999804152 1557 U N [md11_raid5] 1
8,240 1 3352 1.999972330 0 C W 859034120 + 8 [0]
8,240 0 1121 2.000030768 5319 Q WBS [kworker/u:1]
8,240 0 1122 2.000031333 5319 G WBS [kworker/u:1]
...[CUT]...
8,240 2 2236 2.097562839 5319 Q WBS [kworker/u:1]
8,240 2 2237 2.097563398 5319 G WBS [kworker/u:1]
8,240 2 2238 2.097564060 5319 P N [kworker/u:1]
8,240 2 2239 2.097564381 5319 I W [kworker/u:1]
8,240 2 2240 2.097565242 5319 C W 0 [0]
8,240 1 6656 2.097613839 1557 A W 859037688 + 8 <-
(8,241) 859035640
8,240 1 6657 2.097613995 1557 Q W 859037688 + 8 [md11_raid5]
8,240 1 6658 2.097614822 1557 G W 859037688 + 8 [md11_raid5]
8,240 1 6659 2.097615667 1557 P N [md11_raid5]
8,240 1 6660 2.097615929 1557 I W 859037688 + 8 [md11_raid5]
8,240 1 6661 2.097622654 1557 U N [md11_raid5] 1
8,240 1 6662 2.097623343 1557 D W 859037688 + 8 [md11_raid5]
8,240 1 6663 2.097625494 1557 U N [md11_raid5] 1
8,240 1 6664 2.097729515 0 C W 859037688 + 8 [0]
8,240 1 6665 2.097758720 1557 U N [md11_raid5] 0
8,240 0 2236 2.097772162 5319 Q WBS [kworker/u:1]
8,240 0 2237 2.097772725 5319 G WBS [kworker/u:1]
8,240 0 2238 2.097773386 5319 P N [kworker/u:1]
8,240 0 2239 2.097773705 5319 I W [kworker/u:1]
8,240 0 2240 2.097774617 5319 C W 0 [0]
8,240 1 6666 2.097823090 1557 A W 859037696 + 8 <-
(8,241) 859035648
8,240 1 6667 2.097823331 1557 Q W 859037696 + 8 [md11_raid5]
8,240 1 6668 2.097824152 1557 G W 859037696 + 8 [md11_raid5]
8,240 1 6669 2.097824997 1557 P N [md11_raid5]
8,240 1 6670 2.097825247 1557 I W 859037696 + 8 [md11_raid5]
8,240 1 6671 2.097831340 1557 U N [md11_raid5] 1
8,240 1 6672 2.097832048 1557 D W 859037696 + 8 [md11_raid5]
8,240 1 6673 2.097834622 1557 U N [md11_raid5] 1
8,240 1 6674 2.302260290 1557 U N [md11_raid5] 1
8,240 3 2234 4.000203670 0 C W 859037696 + 8 [0]
<--------- write completes here
8,240 2 2241 4.000256295 5319 Q WBS [kworker/u:1]
8,240 2 2242 4.000256890 5319 G WBS [kworker/u:1]
8,240 2 2243 4.000257477 5319 P N [kworker/u:1]
8,240 2 2244 4.000257832 5319 I W [kworker/u:1]
8,240 2 2245 4.000258659 5319 C W 0 [0]
8,240 3 2235 4.000311545 1557 A W 859037704 + 8 <-
(8,241) 859035656
8,240 3 2236 4.000311677 1557 Q W 859037704 + 8 [md11_raid5]
8,240 3 2237 4.000312973 1557 G W 859037704 + 8 [md11_raid5]
8,240 3 2238 4.000313864 1557 P N [md11_raid5]
8,240 3 2239 4.000314120 1557 I W 859037704 + 8 [md11_raid5]
8,240 3 2240 4.000321038 1557 U N [md11_raid5] 1
8,240 3 2241 4.000321735 1557 D W 859037704 + 8 [md11_raid5]
8,240 3 2242 4.000323994 1557 U N [md11_raid5] 1
8,240 1 6675 4.000427558 0 C W 859037704 + 8 [0]
8,240 1 6676 4.000461787 1557 U N [md11_raid5] 0
8,240 0 2241 4.000476233 5319 Q WBS [kworker/u:1]
8,240 0 2242 4.000476793 5319 G WBS [kworker/u:1]
...[CUT]...
8,240 2 3516 4.111962898 5319 Q WBS [kworker/u:1]
8,240 2 3517 4.111963472 5319 G WBS [kworker/u:1]
8,240 2 3518 4.111964044 5319 P N [kworker/u:1]
8,240 2 3519 4.111964378 5319 I W [kworker/u:1]
8,240 2 3520 4.111965295 5319 C W 0 [0]
8,240 3 3804 4.112013696 1557 A W 859033592 + 8 <-
(8,241) 859031544
8,240 3 3805 4.112013841 1557 Q W 859033592 + 8 [md11_raid5]
8,240 3 3806 4.112014653 1557 G W 859033592 + 8 [md11_raid5]
8,240 3 3807 4.112015874 1557 P N [md11_raid5]
8,240 3 3808 4.112016166 1557 I W 859033592 + 8 [md11_raid5]
8,240 3 3809 4.112024245 1557 U N [md11_raid5] 1
8,240 3 3810 4.112025141 1557 D W 859033592 + 8 [md11_raid5]
8,240 3 3811 4.112027938 1557 U N [md11_raid5] 1
8,240 1 10164 4.112131048 6493 C W 859033592 + 8 [0]
8,240 3 3812 4.112163132 1557 U N [md11_raid5] 0
8,240 0 3516 4.112174324 5319 Q WBS [kworker/u:1]
8,240 0 3517 4.112174914 5319 G WBS [kworker/u:1]
8,240 0 3518 4.112175476 5319 P N [kworker/u:1]
8,240 0 3519 4.112175795 5319 I W [kworker/u:1]
8,240 0 3520 4.112176739 5319 C W 0 [0]
8,240 3 3813 4.112221841 1557 A W 859033600 + 8 <-
(8,241) 859031552
8,240 3 3814 4.112221952 1557 Q W 859033600 + 8 [md11_raid5]
8,240 3 3815 4.112222762 1557 G W 859033600 + 8 [md11_raid5]
8,240 3 3816 4.112223586 1557 P N [md11_raid5]
8,240 3 3817 4.112223901 1557 I W 859033600 + 8 [md11_raid5]
8,240 3 3818 4.112229650 1557 U N [md11_raid5] 1
8,240 3 3819 4.112230275 1557 D W 859033600 + 8 [md11_raid5]
8,240 3 3820 4.112232817 1557 U N [md11_raid5] 1
8,240 3 3821 4.312255629 1557 U N [md11_raid5] 1
8,240 3 3822 6.001904854 0 C W 859033600 + 8 [0]
<--------- write completes here
8,240 3 3823 6.001944705 1557 U N [md11_raid5] 0
8,240 2 3521 6.001961244 5319 Q WBS [kworker/u:1]
8,240 2 3522 6.001961855 5319 G WBS [kworker/u:1]
8,240 2 3523 6.001962360 5319 P N [kworker/u:1]
8,240 2 3524 6.001962718 5319 I W [kworker/u:1]
8,240 2 3525 6.001963488 5319 C W 0 [0]
8,240 3 3824 6.002019367 1557 A W 859033608 + 8 <-
(8,241) 859031560
8,240 3 3825 6.002019532 1557 Q W 859033608 + 8 [md11_raid5]
8,240 3 3826 6.002020332 1557 G W 859033608 + 8 [md11_raid5]
8,240 3 3827 6.002021283 1557 P N [md11_raid5]
8,240 3 3828 6.002021547 1557 I W 859033608 + 8 [md11_raid5]
8,240 3 3829 6.002028883 1557 U N [md11_raid5] 1
8,240 3 3830 6.002029497 1557 D W 859033608 + 8 [md11_raid5]
8,240 3 3831 6.002032017 1557 U N [md11_raid5] 1
8,240 1 10165 6.002174214 0 C W 859033608 + 8 [0]
8,240 0 3521 6.002243288 5319 Q WBS [kworker/u:1]
8,240 0 3522 6.002248444 5319 G WBS [kworker/u:1]
--------------------------------------------------
You see how perfect is that? Hangs at exactly .30--.31 every time and
restores after 1.7 seconds at or .00 +/- about 500us !
Maybe it drifts very slowly like 500us at every cycle but it's almost
perfectly synchronized...
Have you seen anything similar?
As I said, I think I saw this bug with XFS 2 years ago on the
3.6.31-rcsomething with a MD raid-6 on the same 3ware.
So if it is a kernel bug it has been there for long
What do you think?
I am cc-ing the open-iscsi and scst people to see if they can suggest a
way to debug the iscsi one. In particular it would be very interesting
to get the number of commands received and replied from target side, and
match it to the number of commands submitted from initiator side. (but I
suspect these status variables do not exist) If target side says it has
already replied, but initiator side sees the reply after 1 second...
that would be very meaningful. The bad thing is that I do not have much
time to debug this :-(
Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html