Re: Setting up md-raid5: observations, errors, questions

Michael Tokarev <mjt@xxxxxxxxxx> · Sun, 02 Mar 2008 21:33:51 +0300

Christian Pernegger wrote:
 > OK. Back to the fs again, same command, different device. Still
 > glacially slow (and still running), only now the whole box is at a
 > standstill, too. cat /proc/cpuinfo takes about 3 minutes (!) to
 > complete, I'm still waiting for top to launch (15min and counting).
 > I'll leave mke2fs running for now ...

 What's the state of your array at this point - is it resyncing?

Yes. Didn't think it would matter (much). Never did before.

It does.  If everything works ok, it should not, but it's not your
case ;)

  o how about making filesystem(s) on individual disks first, to see
    how that will work out?  Maybe on each of them in parallel? :)

Running. System is perfectly responsive during 4x mke2fs -j -q on raw devices.
Done. Upper bound for duration is 8 minutes (probaby much lower,
forgot to let it beep on completion), which is much better than the 2
hours with the syncing RAID.

Aha.  Excellent.

 26:    1041479        267   IO-APIC-fasteoi   sata_promise
 27:          0          0   IO-APIC-fasteoi   sata_promise

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  4      0  12864 1769688  10000    0    0     0 146822  539  809  0 26 23 51

Ok.  146Mb/sec.

Cpu(s):  1.3%us,  8.1%sy,  0.0%ni, 41.6%id, 46.0%wa,  0.7%hi,  2.3%si,  0.0%st

46.0% waiting

I hope you can interpret that :)

Some ;)

  o try --assume-clean when creating the array

mke2fs (same command as in first post) now running on fresh
--assumed-clean array w/o crypto. System is only marginally less
responsive than under idle load, if at all.

So the responsibility problem is solved here, right?  I mean, if
there's no resync going on (the case with --assume-clean), the rest
of the system works as expected, right?

But inode table writing speed is only about 8-10/second. For the
single disk case I couldn't read the numbers fast enough.

Note that mkfs now has to do 3x more work, too - since the device
is 3x (for 4-drive raid5) larger.

chris@jesus:~$ cat /proc/interrupts
 26:    1211165        267   IO-APIC-fasteoi   sata_promise
 27:          0          0   IO-APIC-fasteoi   sata_promise

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1      0  11092 1813376  10804    0    0     0 13316  535 5201  0  9 51 40

That's 10 times slower than in the case of 4 individual disks.

Cpu(s):  0.0%us, 10.1%sy,  0.0%ni, 55.6%id, 33.7%wa,  0.2%hi,  0.3%si,  0.0%st

and only 33.7% waiting, which is probably due to the lack of
parallelism.

From vmstat I gather that total write throughput is an order of
magnitude slower than on the 4 raw disks in parallel. Naturally the
mke2fs on the raid isn't parallelized but it should still be
sequential enough to get the max for a single disk (~60-40MB/s),
right?

Well, not really.  Mkfs is doing many small writes all over the
place, so each is seek+write.  And it's syncronous - no next write
gets submitted till the current one completes.

Ok.  For now I don't see a problem (over than that there IS a problem
somewhere - obviously).  Interrupts are ok.  System time (10.1%) in
second case doesn't look right, but it was 8.1% before...

Only 2 guesses left.  And I really mean "guesses", because I can't
say definitely what's going on anyway.

First, try to disable bitmaps on the raid array, and see if it makes
any difference.  For some reason I think it will... ;)

And second, the whole thing looks pretty much like a more general
problem discussed here and elsewhere last few days.  I mean handling
of parallel reads and writes - when single write may stall reads
for quite some time and vise versa.  I see it every day on disks
without NCQ/TCQ - system is mostly single-tasking, sorta like
ol'good MS-DOG :)  Good TCQ-enabled drives survives very high load
while the system is still more-or-less responsible (and I forgot when
I last saw "bad" TCQ-enabled drive - even 10 y/o 4Gb seagate has
excellent TCQ support ;).  And all modern SATA stuff works pretty
much like old IDE drives, which were designed "for personal use",
or "single-task only" -- even ones that CLAMS to support NCQ in
reality does not....  But that's a long story, and your disks
and/or controllers (or the combination) don't even support NCQ...

/mjt
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html