Re: [BUG] Raid1/5 over iSCSI trouble

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dan Williams wrote:
On Fri, 2007-10-19 at 14:04 -0700, BERTRAND Joël wrote:
        Sorry for this last mail. I have found another mistake, but I
don't
know if this bug comes from iscsi-target or raid5 itself. iSCSI target
is disconnected because istd1 and md_d0_raid5 kernel threads use 100%
of
CPU each !

Tasks: 235 total,   6 running, 227 sleeping,   0 stopped,   2 zombie
Cpu(s):  0.1%us, 12.5%sy,  0.0%ni, 87.4%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   4139032k total,   218424k used,  3920608k free,    10136k
buffers
Swap:  7815536k total,        0k used,  7815536k free,    64808k
cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND

  5824 root      15  -5     0    0    0 R  100  0.0  10:34.25 istd1

  5599 root      15  -5     0    0    0 R  100  0.0   7:25.43
md_d0_raid5

	When iSCSI works fine :

Tasks: 231 total,   2 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s): 0.2%us, 2.5%sy, 0.0%ni, 95.7%id, 0.1%wa, 0.0%hi, 1.5%si, 0.0%st
Mem:   4139032k total,  4126064k used,    12968k free,    94680k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758776k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9774 root 15 -5 0 0 0 R 40 0.0 2:00.34 istd1 9738 root 15 -5 0 0 0 S 9 0.0 2:06.56 md_d0_raid5 4129 root 20 0 41648 5024 2432 S 6 0.1 2:46.39 fail2ban-server 9830 root 20 0 3248 1544 1120 R 1 0.0 0:00.18 top 4063 root 20 0 7424 5288 832 S 1 0.1 0:00.84 unfsd 9776 root 15 -5 0 0 0 D 1 0.0 0:00.82 istiod1 9780 root 15 -5 0 0 0 D 1 0.0 0:00.96 istiod1 9782 root 15 -5 0 0 0 D 1 0.0 0:01.10 istiod1 1 root 20 0 2576 960 816 S 0 0.0 0:01.56 init 2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/0

After a random time (iSCSI target is not disconnected but doesn't answer to initiator requests):

Tasks: 232 total,   5 running, 226 sleeping,   0 stopped,   1 zombie
Cpu(s): 0.1%us, 7.9%sy, 0.0%ni, 91.6%id, 0.0%wa, 0.1%hi, 0.2%si, 0.0%st
Mem:   4139032k total,  4125912k used,    13120k free,    95640k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758792k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9738 root 15 -5 0 0 0 R 100 0.0 3:56.57 md_d0_raid5 9739 root 15 -5 0 0 0 D 14 0.0 0:20.34 md_d0_resync 9845 root 20 0 3248 1544 1120 R 1 0.0 0:07.00 top 4129 root 20 0 41648 5024 2432 S 0 0.1 2:55.94 fail2ban-server 1 root 20 0 2576 960 816 S 0 0.0 0:01.58 init 2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 15 -5 0 0 0 S 0 0.0 0:00.02 ksoftirqd/0 5 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/1 6 root 15 -5 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1

You can see a very strange thing... When I have booted this server, md0_d0 was clean. When bug occurs, md_d0_resync is started (/dev/md/d0p1 is a part of my raid1 array). Why ? This partition is not mounted on local server, only exported by iSCSI.

After disconnection of iSCSI target :

Tasks: 232 total,   7 running, 224 sleeping,   0 stopped,   1 zombie
Cpu(s): 0.0%us, 15.2%sy, 0.0%ni, 84.3%id, 0.0%wa, 0.1%hi, 0.3%si, 0.0%st
Mem:   4139032k total,  4127584k used,    11448k free,    95752k buffers
Swap:  7815536k total,        0k used,  7815536k free,  3758792k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 9738 root 15 -5 0 0 0 R 100 0.0 4:56.82 md_d0_raid5 9774 root 15 -5 0 0 0 R 100 0.0 5:52.41 istd1 9739 root 15 -5 0 0 0 R 14 0.0 0:28.90 md_d0_resync 9916 root 20 0 3248 1544 1120 R 2 0.0 0:00.56 top 4129 root 20 0 41648 5024 2432 S 0 0.1 2:56.17 fail2ban-server 1 root 20 0 2576 960 816 S 0 0.0 0:01.58 init 2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd 3 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/0 4 root 15 -5 0 0 0 S 0 0.0 0:00.02 ksoftirqd/0 5 root RT -5 0 0 0 S 0 0.0 0:00.00 migration/1 6 root 15 -5 0 0 0 S 0 0.0 0:00.00 ksoftirqd/1

What is the output of:
cat /proc/5824/wchan
cat /proc/5599/wchan

Root poulenc:[/usr/scripts] > cat /proc/9738/wchan
_startRoot poulenc:[/usr/scripts] > cat /proc/9774/wchan
_startRoot poulenc:[/usr/scripts] > vmstat -a
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free inact active si so bi bo in cs us sy id wa 5 0 0 10824 3777528 112280 0 0 7 19 12 19 0 0 100 0
Root poulenc:[/usr/scripts] > vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 5 0 0 10928 95856 3756880 0 0 7 19 12 19 0 0 100 0
Root poulenc:[/usr/scripts] >  vmstat -s
      4139032 K total memory
      4127864 K used memory
       112216 K active memory
      3777568 K inactive memory
        11168 K free memory
        95928 K buffer memory
      3756896 K swap cache
      7815536 K total swap
            0 K used swap
      7815536 K free swap
        26901 non-nice user cpu ticks
          824 nice user cpu ticks
       204746 system cpu ticks
     94245668 idle cpu ticks
        14378 IO-wait cpu ticks
         3086 IRQ cpu ticks
        33971 softirq cpu ticks
            0 stolen cpu ticks
      6555730 pages paged in
     18136571 pages paged out
            0 pages swapped in
            0 pages swapped out
     11259263 interrupts
     18167358 CPU context switches
   1192827483 boot time
         9962 forks
Root poulenc:[/usr/scripts] > vmstat -d
disk- ------------reads------------ ------------writes----------- -----IO------ total merged sectors ms total merged sectors ms cur sec sda 716720 143247 94849012 2617628 6732 24789 269070 222236 0 532 sdb 103590 23780 6140736 85244 409226 308936 88160014 13352564 0 929 md0 17469 0 456250 0 4557 0 36456 0 0 0 sdc 265108 2103743 37883308 2810656 266586 272237 8767696 628236 0 825 sdd 266248 2099943 37844236 2801400 264081 275321 8781088 609140 0 824 sde 263660 2104487 37875132 2835548 262296 276561 8776000 595140 0 826 sdf 283262 2084095 37862108 2432988 262197 277305 8785600 581008 0 779 sdg 285205 2082611 37870324 2291464 260836 278822 8791456 567908 0 752 sdh 291773 2072874 37817788 1892320 260572 278182 8775472 550688 0 685 loop0 0 0 0 0 0 0 0 0 0 0 loop1 0 0 0 0 0 0 0 0 0 0 loop2 0 0 0 0 0 0 0 0 0 0 loop3 0 0 0 0 0 0 0 0 0 0 loop4 0 0 0 0 0 0 0 0 0 0 loop5 0 0 0 0 0 0 0 0 0 0 loop6 0 0 0 0 0 0 0 0 0 0 loop7 0 0 0 0 0 0 0 0 0 0 md6 31 0 496 0 0 0 0 0 0 0 md1 4326 0 161366 0 27 0 110 0 0 0 md2 206279 0 4713706 0 14670 0 118752 0 0 0 md3 6709 0 392442 0 9964 0 80040 0 0 0 disk- ------------reads------------ ------------writes----------- -----IO------ total merged sectors ms total merged sectors ms cur sec md4 247 0 3746 0 131 0 1208 0 0 0 md5 63245 0 7365546 0 292 0 2424 0 0 0 md_d0 14 0 216 0 642029 0 36004104 0 0 0
Root poulenc:[/usr/scripts] >

Please note that zombies process are not signifiant for this server. It runs watchdog and zombies process counter is allways between 0 and 2.

	When iSCSI target hangs, load average is :
load average: 14.03, 13.63, 10.47 with only md_d0_raid5, istd1 and md_d0_resync running process.

9774 root 15 -5 0 0 0 R 100 0.0 18:17.63 istd1 9738 root 15 -5 0 0 0 R 100 0.0 17:22.04 md_d0_raid5 9739 root 15 -5 0 0 0 R 14 0.0 2:15.18 md_d0_resync

	I won't reboot this server if you need some other information.

	Regards,

	JKB
-
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux