On Sunday 22 April 2007 12:23:12 Justin Piszcz wrote: > On Sun, 22 Apr 2007, Pallai Roland wrote: > > On Sunday 22 April 2007 10:47:59 Justin Piszcz wrote: > >> On Sun, 22 Apr 2007, Pallai Roland wrote: > >>> On Sunday 22 April 2007 02:18:09 Justin Piszcz wrote: > >>>> How did you run your read test? > >>> > >>> I did run 100 parallel reader process (dd) top of XFS file system, try > >>> this: for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k > >>> 2>/dev/null; done for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k > >>> 2>/dev/null & done > >>> > >>> and don't forget to set max_sectors_kb below chunk size (eg. 64/128Kb) > >>> /sys/block# for i in sd*; do echo 64 >$i/queue/max_sectors_kb; done > >>> > >>> I also set 2048/4096 readahead sectors with blockdev --setra > >>> > >>> You need 50-100 reader processes for this issue, I think so. My kernel > >>> version is 2.6.20.3 > >> > >> In one xterm: > >> for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=64k 2>/dev/null; done > >> > >> In another: > >> for i in `seq 1 100`; do dd if=/dev/md3 of=$i.out bs=64k & done > > > > Write and read files top of XFS, not on the block device. $i isn't a > > typo, you should write into 100 files and read back by 100 threads in > > parallel when done. I've 1Gb of RAM, maybe you should use mem= kernel > > parameter on boot. > > > > 1. for i in `seq 1 100`; do dd of=$i if=/dev/zero bs=1M count=100 > > 2>/dev/null; done > > 2. for i in `seq 1 100`; do dd if=$i of=/dev/zero bs=64k 2>/dev/null & > > done > > > > I use a combination of 4 Silicon Image controllers (SiI) and the Intel 965 > chipset. My max_sectors_kb is 128kb, my chunk size is 128kb, why do you > set the max_sectors_kb less than the chunk size? It's the maximum on Marvell SATA chips under Linux. Maybe hardware limitation. I just would used 128Kb chunk but I hit this issue. > For read-ahead, there > are some good benchmarks by SGI(?) I believe and some others that states > 16MB is the best value, over that, you lose on reads/writes or the other, > 16MB appears to be optimal for best overall value. Do these values look > good to you, or? Where can I found this bechmark? I did some test on this topic, too. I think the optimal readahead size always depend on the number of sequentally reading processes and the available RAM. If you've 100 processes and 1Gb of RAM, max optimal readahead is about 5-6Mb, if you set it bigger that turns into readahead thrashing and undesirable context switches. Anyway, I tried 16Mb now, but the readahead size doesn't matter in this bug, same context switch storm appears with any readahead window size. > Read 100 files on XFS simultaneously: max_sectors_kb is 128kb is here? I think so. I see some anomaly, but maybe just you've too big readahead window for so many processes, it's not the bug what I'm talking about in my original post. High interrupt and CS count has been building slowly, it may a sign of readahead thrashing. In my case the CS storm began in the first second and no high interrupt count: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 7220 0 940972 0 0 0 0 256 20 0 0 100 0 0 13 0 383636 0 535520 0 0 144904 32 2804 63834 1 42 0 57 24 20 0 353312 0 558200 0 0 121524 0 2669 67604 1 40 0 59 15 21 0 314808 0 557068 0 0 91300 33 2572 53442 0 29 0 71 I attached a small kernel patch, you can measure readahead thrashing ratio with this (see tail of /proc/vmstat). I think it's a handy tool to find the optimal RA-size. And if you're interested in the bug what I'm talking about, set max_sectors_kb to 64Kb. -- d
--- linux-2.6.18.2/include/linux/vmstat.h.orig 2006-09-20 05:42:06.000000000 +0200 +++ linux-2.6.18.2/include/linux/vmstat.h 2006-11-06 02:09:25.000000000 +0100 @@ -30,6 +30,7 @@ FOR_ALL_ZONES(PGSCAN_DIRECT), PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL, PAGEOUTRUN, ALLOCSTALL, PGROTATED, + RATHRASHED, NR_VM_EVENT_ITEMS }; --- linux-2.6.18.2/mm/vmstat.c.orig 2006-11-06 01:55:58.000000000 +0100 +++ linux-2.6.18.2/mm/vmstat.c 2006-11-06 02:05:14.000000000 +0100 @@ -502,6 +502,8 @@ "allocstall", "pgrotated", + + "rathrashed", #endif }; --- linux-2.6.18.2/mm/readahead.c.orig 2006-09-20 05:42:06.000000000 +0200 +++ linux-2.6.18.2/mm/readahead.c 2006-11-06 02:13:12.000000000 +0100 @@ -568,6 +568,7 @@ ra->flags |= RA_FLAG_MISS; ra->flags &= ~RA_FLAG_INCACHE; ra->cache_hit = 0; + count_vm_event(RATHRASHED); } /*