Re: [BUG] Raid1/5 over iSCSI trouble

BERTRAND Joël <joel.bertrand@xxxxxxxxxxx> · Sat, 20 Oct 2007 10:05:50 +0200

Dan Williams wrote:
On Fri, 2007-10-19 at 14:04 -0700, BERTRAND Joël wrote:
        Sorry for this last mail. I have found another mistake, but I
don't
know if this bug comes from iscsi-target or raid5 itself. iSCSI target
is disconnected because istd1 and md_d0_raid5 kernel threads use 100%
of
CPU each !

Tasks: 235 total,   6 running, 227 sleeping,   0 stopped,   2 zombie
Cpu(s):  0.1%us, 12.5%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   4139032k total,   218424k used,  3920608k free,    10136k
buffers
Swap:  7815536k total,        0k used,  7815536k free,    64808k
cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

  5824 root      15  -5     0    0    0 R  100  0.0  10:34.25 istd1

  5599 root      15  -5     0    0    0 R  100  0.0   7:25.43
md_d0_raid5

	When iSCSI works fine :

Tasks: 231 total,   2 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us,  2.5%sy,  0.0%ni, 95.7%id,  0.1%wa,  0.0%hi,  1.5%si, 
0.0%st
Mem:   4139032k total,  4126064k used,    12968k free,    94680k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758776k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 

 9774 root      15  -5     0    0    0 R   40  0.0   2:00.34 istd1 

 9738 root      15  -5     0    0    0 S    9  0.0   2:06.56 
md_d0_raid5
 4129 root      20   0 41648 5024 2432 S    6  0.1   2:46.39 
fail2ban-server
 9830 root      20   0  3248 1544 1120 R    1  0.0   0:00.18 top 

 4063 root      20   0  7424 5288  832 S    1  0.1   0:00.84 unfsd 

 9776 root      15  -5     0    0    0 D    1  0.0   0:00.82 istiod1 

 9780 root      15  -5     0    0    0 D    1  0.0   0:00.96 istiod1 

 9782 root      15  -5     0    0    0 D    1  0.0   0:01.10 istiod1 

    1 root      20   0  2576  960  816 S    0  0.0   0:01.56 init 

    2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd 

    3 root      RT  -5     0    0    0 S    0  0.0   0:00.00 
migration/0

After a random time (iSCSI target is not disconnected but doesn't answer 
to initiator requests):

Tasks: 232 total,   5 running, 226 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.1%us,  7.9%sy,  0.0%ni, 91.6%id,  0.0%wa,  0.1%hi,  0.2%si, 
0.0%st
Mem:   4139032k total,  4125912k used,    13120k free,    95640k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758792k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 

 9738 root      15  -5     0    0    0 R  100  0.0   3:56.57 
md_d0_raid5
 9739 root      15  -5     0    0    0 D   14  0.0   0:20.34 
md_d0_resync
 9845 root      20   0  3248 1544 1120 R    1  0.0   0:07.00 top 

 4129 root      20   0 41648 5024 2432 S    0  0.1   2:55.94 
fail2ban-server
    1 root      20   0  2576  960  816 S    0  0.0   0:01.58 init 

    2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd 

    3 root      RT  -5     0    0    0 S    0  0.0   0:00.00 
migration/0
    4 root      15  -5     0    0    0 S    0  0.0   0:00.02 
ksoftirqd/0
    5 root      RT  -5     0    0    0 S    0  0.0   0:00.00 
migration/1
    6 root      15  -5     0    0    0 S    0  0.0   0:00.00 
ksoftirqd/1

	You can see a very strange thing... When I have booted this server, 
md0_d0 was clean. When bug occurs, md_d0_resync is started (/dev/md/d0p1 
is a part of my raid1 array). Why ? This partition is not mounted on 
local server, only exported by iSCSI.

After disconnection of iSCSI target :

Tasks: 232 total,   7 running, 224 sleeping,   0 stopped,   1 zombie
Cpu(s):  0.0%us, 15.2%sy,  0.0%ni, 84.3%id,  0.0%wa,  0.1%hi,  0.3%si, 
0.0%st
Mem:   4139032k total,  4127584k used,    11448k free,    95752k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758792k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND 

 9738 root      15  -5     0    0    0 R  100  0.0   4:56.82 
md_d0_raid5
 9774 root      15  -5     0    0    0 R  100  0.0   5:52.41 istd1 

 9739 root      15  -5     0    0    0 R   14  0.0   0:28.90 
md_d0_resync
 9916 root      20   0  3248 1544 1120 R    2  0.0   0:00.56 top 

 4129 root      20   0 41648 5024 2432 S    0  0.1   2:56.17 
fail2ban-server
    1 root      20   0  2576  960  816 S    0  0.0   0:01.58 init 

    2 root      15  -5     0    0    0 S    0  0.0   0:00.00 kthreadd 

    3 root      RT  -5     0    0    0 S    0  0.0   0:00.00 
migration/0
    4 root      15  -5     0    0    0 S    0  0.0   0:00.02 
ksoftirqd/0
    5 root      RT  -5     0    0    0 S    0  0.0   0:00.00 
migration/1
    6 root      15  -5     0    0    0 S    0  0.0   0:00.00 
ksoftirqd/1

What is the output of:
cat /proc/5824/wchan
cat /proc/5599/wchan

Root poulenc:[/usr/scripts] > cat /proc/9738/wchan
_startRoot poulenc:[/usr/scripts] > cat /proc/9774/wchan
_startRoot poulenc:[/usr/scripts] > vmstat -a
procs -----------memory---------- ---swap-- -----io---- -system-- 
----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy 
id wa
 5  0      0  10824 3777528 112280    0    0     7    19   12   19  0 
0 100  0
Root poulenc:[/usr/scripts] > vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- 
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy 
id wa
 5  0      0  10928  95856 3756880    0    0     7    19   12   19  0 
0 100  0
Root poulenc:[/usr/scripts] >  vmstat -s
      4139032 K total memory
      4127864 K used memory
       112216 K active memory
      3777568 K inactive memory
        11168 K free memory
        95928 K buffer memory
      3756896 K swap cache
      7815536 K total swap
            0 K used swap
      7815536 K free swap
        26901 non-nice user cpu ticks
          824 nice user cpu ticks
       204746 system cpu ticks
     94245668 idle cpu ticks
        14378 IO-wait cpu ticks
         3086 IRQ cpu ticks
        33971 softirq cpu ticks
            0 stolen cpu ticks
      6555730 pages paged in
     18136571 pages paged out
            0 pages swapped in
            0 pages swapped out
     11259263 interrupts
     18167358 CPU context switches
   1192827483 boot time
         9962 forks
Root poulenc:[/usr/scripts] > vmstat -d
disk- ------------reads------------ ------------writes----------- 
-----IO------
       total merged sectors      ms  total merged sectors      ms 
cur    sec
sda   716720 143247 94849012 2617628   6732  24789  269070  222236 
0    532
sdb   103590  23780 6140736   85244 409226 308936 88160014 13352564 
 0    929
md0    17469      0  456250       0   4557      0   36456       0      0 
     0
sdc   265108 2103743 37883308 2810656 266586 272237 8767696  628236 
 0    825
sdd   266248 2099943 37844236 2801400 264081 275321 8781088  609140 
 0    824
sde   263660 2104487 37875132 2835548 262296 276561 8776000  595140 
 0    826
sdf   283262 2084095 37862108 2432988 262197 277305 8785600  581008 
 0    779
sdg   285205 2082611 37870324 2291464 260836 278822 8791456  567908 
 0    752
sdh   291773 2072874 37817788 1892320 260572 278182 8775472  550688 
 0    685
loop0      0      0       0       0      0      0       0       0      0 
     0
loop1      0      0       0       0      0      0       0       0      0 
     0
loop2      0      0       0       0      0      0       0       0      0 
     0
loop3      0      0       0       0      0      0       0       0      0 
     0
loop4      0      0       0       0      0      0       0       0      0 
     0
loop5      0      0       0       0      0      0       0       0      0 
     0
loop6      0      0       0       0      0      0       0       0      0 
     0
loop7      0      0       0       0      0      0       0       0      0 
     0
md6       31      0     496       0      0      0       0       0      0 
     0
md1     4326      0  161366       0     27      0     110       0      0 
     0
md2   206279      0 4713706       0  14670      0  118752       0      0 
     0
md3     6709      0  392442       0   9964      0   80040       0      0 
     0
disk- ------------reads------------ ------------writes----------- 
-----IO------
       total merged sectors      ms  total merged sectors      ms 
cur    sec
md4      247      0    3746       0    131      0    1208       0      0 
     0
md5    63245      0 7365546       0    292      0    2424       0      0 
     0
md_d0     14      0     216       0 642029      0 36004104       0 
0      0
Root poulenc:[/usr/scripts] >

	Please note that zombies process are not signifiant for this server. It 
runs watchdog and zombies process counter is allways between 0 and 2.

	When iSCSI target hangs, load average is :
load average: 14.03, 13.63, 10.47 with only md_d0_raid5, istd1 and 
md_d0_resync running process.

 9774 root      15  -5     0    0    0 R  100  0.0  18:17.63 istd1 

 9738 root      15  -5     0    0    0 R  100  0.0  17:22.04 
md_d0_raid5
 9739 root      15  -5     0    0    0 R   14  0.0   2:15.18 
md_d0_resync

	I won't reboot this server if you need some other information.

	Regards,

	JKB
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html