[RFC PATCH 0/2] raid5: 65% sequential-write performance improvement, stripe-queue take2

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 03 Jul 2007 16:20:16 -0700

The first take of the stripe-queue implementation[1] had a performance
limiting bug in __wait_for_inactive_queue.  Fixing that issue
drastically changed the performance characteristics.  The following data
from tiobench shows the relative performance difference of the
stripe-queue patchset.

Unit information
================
File size = megabytes
Blk Size  = bytes
Num Thr   = number of threads
Avg Rate  = relative throughput
CPU%      = relative percentage of CPU used during the test
CPU Eff   = Rate divided by CPU% - relative throughput per cpu load

Configuration
=============
Platform: 1200Mhz iop348 with 4-disk sata_vsc array
mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5
mkfs.ext2 /dev/md0
mount /dev/md0 /mnt/raid
tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid

Sequential Reads
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       0%      4%      -3%
2.6.22-rc7-iop1 2048    4096    2       -38%    -33%    -8%
2.6.22-rc7-iop1 2048    4096    4       -35%    -30%    -8%
2.6.22-rc7-iop1 2048    4096    8       -14%    -11%    -3%
2.6.22-rc7-iop1 2048    13107   1       2%      1%      2%
2.6.22-rc7-iop1 2048    13107   2       -11%    -10%    -2%
2.6.22-rc7-iop1 2048    13107   4       -7%     -6%     -1%
2.6.22-rc7-iop1 2048    13107   8       -9%     -6%     -4%

Random  Reads
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       -9%     15%     -21%
2.6.22-rc7-iop1 2048    4096    2       -1%     -30%    42%
2.6.22-rc7-iop1 2048    4096    4       -14%    -22%    10%
2.6.22-rc7-iop1 2048    4096    8       -21%    -28%    9%
2.6.22-rc7-iop1 2048    13107   1       -8%     -4%     -4%
2.6.22-rc7-iop1 2048    13107   2       -13%    -13%    0%
2.6.22-rc7-iop1 2048    13107   4       -15%    -15%    0%
2.6.22-rc7-iop1 2048    13107   8       -13%    -13%    0%

Sequential Writes
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       25%     11%     12%
2.6.22-rc7-iop1 2048    4096    2       41%     42%     -1%
2.6.22-rc7-iop1 2048    4096    4       40%     18%     19%
2.6.22-rc7-iop1 2048    4096    8       15%     -5%     21%
2.6.22-rc7-iop1 2048    13107   1       65%     57%     4%
2.6.22-rc7-iop1 2048    13107   2       46%     36%     8%
2.6.22-rc7-iop1 2048    13107   4       24%     -7%     34%
2.6.22-rc7-iop1 2048    13107   8       28%     -15%    51%

Random  Writes
                File    Blk     Num     Avg     Maximum CPU
Identifier      Size    Size    Thr     Rate    (CPU%)  Eff
--------------- ------  -----   ---     ------  ------  -----
2.6.22-rc7-iop1 2048    4096    1       2%      -8%     11%
2.6.22-rc7-iop1 2048    4096    2       -1%     -19%    21%
2.6.22-rc7-iop1 2048    4096    4       2%      2%      0%
2.6.22-rc7-iop1 2048    4096    8       -1%     -28%    37%
2.6.22-rc7-iop1 2048    13107   1       2%      -3%     5%
2.6.22-rc7-iop1 2048    13107   2       3%      -4%     7%
2.6.22-rc7-iop1 2048    13107   4       4%      -3%     8%
2.6.22-rc7-iop1 2048    13107   8       5%      -9%     15%

The write performance numbers are better than I expected and would seem
to address the concerns raised in the thread "Odd (slow) RAID
performance"[2].  The read performance drop was not expected.  However,
the numbers suggest some additional changes to be made to the queuing
model.  Where read performance is dropping there appears to be an equal
drop in CPU utilization, which seems to suggest that pure read requests
be handled immediately without a trip to the the stripe-queue workqueue.

Although it is not shown in the above data, another positive aspect is that
increasing the cache size past a certain point causes the write performance
gains to erode.  In other words negative returns in contrast to diminishing
returns.  The stripe-queue can only carry out optimizations while the cache is
busy.  When the cache is large requests can be handled without waiting, and
performance approaches the original 1:1 (queue-to-stripe-head) model.  CPU
speed dictates the maximum effective cache size.  Once the CPU can no longer
keep the stripe-queue saturated performance falls off from the peak.  This is
a positive change because it shows that the new queuing model can produce higher
performance with less resources, but it does require more care when changing
'stripe_cache_size.'  The above numbers were taken with the default cache size
of 256.

Changes since take1:
* separate write and overwrite in the io_weight fields, i.e. an overwrite
  no longer implies a write
* rename queue_weight -> io_weight
* fix r5_io_weight_size
* implement support for sysfs changes to stripe_cache_size
* delete and re-add stripe queues from their management lists rather than
  moving them.  This guarantees that when the sq->count is non-zero the
  queue is not on a list (identical to stripe_head handling)
* __wait_for_inactive_queue was incorrectly using conf->inactive_blocked
  which is exclusively for the stripe_cache.  Added
  conf->inactive_queue_blocked and set the routine to wait until the
  number of active queues drops below 7/8's of the total before unblocking
  processing.  7/8's arises from the following: get_active_stripe waits
  for 3/4's of the stripe cache i.e. 1/4 inactive.  conf->max_nr_stripes /
  4 == conf->max_nr_stripes * STRIPE_QUEUE_SIZE / 8 iff STRIPE_QUEUE_SIZE
  == 2
* change raid5_congested to report whether the *queue* is congested, not
  the cache.

As before these patches are on top of the md-accel series:

	git://lost.foo-projects.org/~dwillia2/git/iop md-accel+experimental

For those wanting to test these changes in isolation from other major kernel
changes look for the upcoming 2.6.22-iop1 kernel release on SourceForge[3].

--
Dan

[1] http://marc.info/?l=linux-raid&m=118297138908827&w=2
[2] http://marc.info/?l=linux-raid&m=116489600902178&w=2
[3] https://sourceforge.net/projects/xscaleiop/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html