Re: major performance drop on raid5 due to context switches caused by small max_hw_sectors [partially resolved]

Pallai Roland <dap@xxxxxxxxxxxxx> · Sun, 22 Apr 2007 13:38:45 +0200

On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote:
> On Sun, 22 Apr 2007, Pallai Roland wrote:
> > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote:
> >> On Sun, 22 Apr 2007, Pallai Roland wrote:
> >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote:
> >>>> How did you run your read test?
> >>>
> >>> I did run 100 parallel reader process (dd) top of XFS file system, try
> >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k
> >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k
> >>> 2>/dev/null & done
> >>>
> >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb)
> >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done
> >>>
> >>> I also set 2048/4096 readahead sectors with blockdev --setra
> >>>
> >>> You need 50-100 reader processes for this issue, I think so. My kernel
> >>> version is 2.6.20.3
> >>
> >> In one xterm:
> >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done
> >>
> >> In another:
> >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done
> >
> > Write and read files top of XFS, not on the block device. $i isn't a
> > typo, you should write into 100 files and read back by 100 threads in
> > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel
> > parameter on boot.
> >
> > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100
> > 2>/dev/null; done
> > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null &
> > done
> >
>
> I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965
> chipset.  My max_sectors_kb is 128kb, my chunk size is 128kb, why do you
> set the max_sectors_kb less than the chunk size?
 It's the maximum on Marvell SATA chips under Linux. Maybe hardware 
limitation. I just would used 128Kb chunk but I hit this issue.

> For read-ahead, there 
> are some good benchmarks by SGI(?) I believe and some others that states
> 16MB is the best value, over that, you lose on reads/writes or the other,
> 16MB appears to be optimal for best overall value.  Do these values look
> good to you, or?
 Where can I found this bechmark? I did some test on this topic, too. I think 
the optimal readahead size always depend on the number of sequentally reading 
processes and the available RAM. If you've 100 processes and 1Gb of RAM, max 
optimal readahead is about 5-6Mb, if you set it bigger that turns into 
readahead thrashing and undesirable context switches. Anyway, I tried 16Mb 
now, but the readahead size doesn't matter in this bug, same context switch 
storm appears with any readahead window size.

> Read 100 files on XFS simultaneously:
 max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe 
just you've too big readahead window for so many processes, it's not the bug 
what I'm talking about in my original post. High interrupt and CS count has 
been building slowly, it may a sign of readahead thrashing. In my case the CS 
storm began in the first second and no high interrupt count:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
0  0      0   7220      0 940972    0    0     0     0  256    20  0  0 100  0
0 13      0 383636      0 535520    0    0 144904    32 2804 63834  1 42  0 57
24 20     0 353312      0 558200    0    0 121524     0 2669 67604  1 40  0 59
15 21      0 314808      0 557068    0    0 91300    33 2572 53442  0 29  0 71

 I attached a small kernel patch, you can measure readahead thrashing ratio 
with this (see tail of /proc/vmstat). I think it's a handy tool to find the 
optimal RA-size. And if you're interested in the bug what I'm talking about, 
set max_sectors_kb to 64Kb.


--
 d

--- linux-2.6.18.2/include/linux/vmstat.h.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/include/linux/vmstat.h	2006-11-06 02:09:25.000000000 +0100
@@ -30,6 +30,7 @@
 		FOR_ALL_ZONES(PGSCAN_DIRECT),
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		RATHRASHED,
 		NR_VM_EVENT_ITEMS
 };
 
--- linux-2.6.18.2/mm/vmstat.c.orig	2006-11-06 01:55:58.000000000 +0100
+++ linux-2.6.18.2/mm/vmstat.c	2006-11-06 02:05:14.000000000 +0100
@@ -502,6 +502,8 @@
 	"allocstall",
 
 	"pgrotated",
+
+	"rathrashed",
 #endif
 };
 
--- linux-2.6.18.2/mm/readahead.c.orig	2006-09-20 05:42:06.000000000 +0200
+++ linux-2.6.18.2/mm/readahead.c	2006-11-06 02:13:12.000000000 +0100
@@ -568,6 +568,7 @@
 	ra->flags |= RA_FLAG_MISS;
 	ra->flags &= ~RA_FLAG_INCACHE;
 	ra->cache_hit = 0;
+	count_vm_event(RATHRASHED);
 }
 
 /*