On Mon, 2009-06-22 at 21:10 +0200, Heinz Mauelshagen wrote: > On Sun, 2009-06-21 at 22:06 +1000, Neil Brown wrote: > > On Friday June 19, heinzm@xxxxxxxxxx wrote: > > > On Fri, 2009-06-19 at 11:43 +1000, Neil Brown wrote: > > > > On Wednesday June 17, neilb@xxxxxxx wrote: > > > > > > > > > > I will try to find time to review your dm-raid5 code with a view to > > > > > understanding how it plugs in to dm, and then how the md/raid5 engine > > > > > can be used by dm-raid5. > > > > > > Hi Neil. > > > > > > > > > > > I've had a bit of a look through the dm-raid5 patches. > > > > > > Thanks. > > > > > > > > > > > Some observations: > > > > > > > > - You have your own 'xor' code against which you do a run-time test of > > > > the 'xor_block' code which md/raid5 uses - then choose the fasted. > > > > This really should not be necessary. If you have xor code that runs > > > > faster than anything in xor_block, it really would be best to submit > > > > it for inclusion in the common xor code base. > > > > > > This is in because it actually shows better performance regularly by > > > utilizing cache lines etc. more efficiently (tested on Intel, AMD and > > > Sparc). > > > > > > If xor_block would always have been performed best, I'd dropped that > > > optimization already. <SNIP> Dan, Neil, like mentioned before I left to LinuxTag last week, here comes an initial take on dm-raid45 warm/cold CPU cache xor speed optimization metrics. This shall give us the base to decide to keep or drop the dm-raid45 internal xor optimization magic or move (part of) it into the crypto subsystem. Heinz Howto: ------ I added a loop to walk the list of recovery stripes to dm-raid45.c in xor_optimize() to allow for over committing the cache and some variables to be able to display absolute minimum and maximum xor runs performed plus the number of xor runs achieved per cycle for xor_blocks() and for the dm-raid45 build in xor optimization. In order to make results more deterministic, I run xor_speed() for <= 5 ticks. See diff vs. dm-devel dm-raid45 patch (submitted Jun 15th) attached. Tests being performed on the following 2 systems: hostname: a4 2.6.31-rc1@250HZ timer frequency Core i7 920@xxxxxx, 8 MB 3rd Level Cache 6GB RAM hostname: t4 2.6.31-rc1@250HZ timer frequency 2 CPU Opeteron 280@xxxxxx, 2*1 MB 2nd Level Cache 2GB RAM with the xor optimization being the only load on the systems. I've performed test runs on each of those with the following mapping tables and 128 iterations for each of them, which represents a small array case with 3 drives per set, running the xor optimization on a single core: Intel: 0 58720256 raid45 core 2 8192 nosync raid5_la 7 -1 -1 -1 512 10 nosync 1 3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0 ... 0 58720256 raid45 core 2 8192 nosync raid5_la 7 -1 -1 -1 512 10 nosync 13 3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0 Opteron: 0 58720256 raid45 core 2 8192 nosync raid5_la 7 -1 -1 -1 256 10 nosync 1 3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0 ... 0 58720256 raid45 core 2 8192 nosync raid5_la 7 -1 -1 -1 256 10 nosync 13 3 -1 /dev/mapper/error1 0 /dev/mapper/error2 0 /dev/mapper/error3 0 Because no actual IO is being performed, I just mapped to error targets (table used: "0 2199023255552 error"; I know it's large but it ain't matter). The number following the 2nd nosync parameter is the amount of recovery stripes with io size of 512 sectors = 256 kilobytes per chunk or 256 sectors = 128 kilobytes per chunk respectively. I.e. a work set of 768/384 kilobytes per recovery stripe. These values shall make sure, that results differ in the per mille range (i.e. more than 100 cycles per test run) where appropriate. Systems are running out of cache at ~ >= 8 stripes on the Intel (8192 - 2048 code / (512 / 2) / 3) and ~ >= 0 stripes on the Opteron system (1024 - 768 code) / (256 / 2) / 3). assuming some cache utilization for code and other data. See raw kernel log extracts being created by these test runs attached in a tarball and the script to extract the metrics as well. Intel results with 128 iterations each: --------------------------------------- 1 stripe : NB:10 111/80 HM:118 111/82 2 stripes : NB:25 113/87 HM:103 112/91 3 stripes : NB:24 115/93 HM:104 114/93 4 stripes : NB:48 114/93 HM:80 114/93 5 stripes : NB:38 113/94 HM:90 114/94 6 stripes : NB:25 116/94 HM:103 114/94 7 stripes : NB:25 115/95 HM:103 115/95 8 stripes : NB:62 117/96 HM:66 116/95 <<<--- cold cache starts here 9 stripes : NB:66 117/96 HM:62 116/95 10 stripes: NB:73 117/96 HM:55 114/95 11 stripes: NB:63 114/96 HM:65 112/95 12 stripes: NB:51 111/96 HM:77 110/95 13 stripes: NB:65 109/96 HM:63 112/95 NB: number of xor_blocks() parity calculations winning per 128 iterations HM: number of dm-raid45 xor() parity calculations equal to/winning xor_blocks per 128 iterations NN/MM: count of maximm/minimum calculations achived per iteration in <= 5 ticks. Opteron results with 128 iterations each: ----------------------------------------- 1 stripe : NB:0 30/20 HM:128 64/53 2 stripes : NB:0 31/21 HM:128 68/55 3 stripes : NB:0 31/22 HM:128 68/57 4 stripes : NB:0 32/22 HM:128 70/61 5 stripes : NB:0 32/22 HM:128 70/63 6 stripes : NB:0 35/22 HM:128 70/64 7 stripes : NB:0 32/23 HM:128 69/63 8 stripes : NB:0 44/23 HM:128 76/65 9 stripes : NB:0 43/23 HM:128 73/65 10 stripes: NB:0 35/23 HM:128 72/64 11 stripes: NB:0 35/24 HM:128 72/64 12 stripes: NB:0 33/24 HM:128 72/65 13 stripes: NB:0 33/23 HM:128 71/64 Test analysis: -------------- I must have done something wrong ;-) On the Opteron, dm-raid45 xor() outperforms xor_blocks() by far. No warm cache significance visible. On the Intel, dm-raid45 xor() performs slightly better on warm cache vs. xor_blocks() performing slightly better on cold cache, which may be the result of the lag of prefetching in dm-raid45 xor(). xor_blocks() achieves a slightly better maximum in 8 / 13 runs vs. xor() in 2 test runs. in 3 runs they achieve the same maximum. This is not deterministic: min/max varying by up to > 200% on the Opteron and up to 46% on the Intel. Questions/Recommendations: -------------------------- Review the code changes and the data analysis please. Review the test cases and argue if those are valid or recommend different ones please. Can we get this more deterministic (e.g. use prefetching for dm-raid45 xor()) ? Regards, Heinz
Attachment:
xor_performance_metrics
Description: application/shellscript
{drivers => /usr/src/linux/drivers}/md/dm-raid45.c | 139 ++++++++++++++------ 1 files changed, 98 insertions(+), 41 deletions(-) diff --git a/drivers/md/dm-raid45.c b/drivers/md/dm-raid45.c index 0c33fea..6ace975 100644 --- a/drivers/md/dm-raid45.c +++ b/linux/drivers/md/dm-raid45.c @@ -8,7 +8,28 @@ * * Linux 2.6 Device Mapper RAID4 and RAID5 target. * - * Supports: + * Tested-by: Intel; Marcin.Labun@xxxxxxxxx, krzysztof.wojcik@xxxxxxxxx + * + * + * Supports the following ATARAID vendor solutions (and SNIA DDF): + * + * Adaptec HostRAID ASR + * SNIA DDF1 + * Hiphpoint 37x + * Hiphpoint 45x + * Intel IMSM + * Jmicron ATARAID + * LSI Logic MegaRAID + * NVidia RAID + * Promise FastTrack + * Silicon Image Medley + * VIA Software RAID + * + * via the dmraid application. + * + * + * Features: + * * o RAID4 with dedicated and selectable parity device * o RAID5 with rotating parity (left+right, symmetric+asymmetric) * o recovery of out of sync device for initial @@ -37,7 +58,7 @@ * ANALYZEME: recovery bandwidth */ -static const char *version = "v0.2594p"; +static const char *version = "v0.2596l"; #include "dm.h" #include "dm-memcache.h" @@ -101,9 +122,6 @@ static const char *version = "v0.2594p"; /* Check value in range. */ #define range_ok(i, min, max) (i >= min && i <= max) -/* Check argument is power of 2. */ -#define POWER_OF_2(a) (!(a & (a - 1))) - /* Structure access macros. */ /* Derive raid_set from stripe_cache pointer. */ #define RS(x) container_of(x, struct raid_set, sc) @@ -1848,10 +1866,10 @@ struct xor_func { xor_function_t f; const char *name; } static xor_funcs[] = { - { xor_8, "xor_8" }, - { xor_16, "xor_16" }, - { xor_32, "xor_32" }, { xor_64, "xor_64" }, + { xor_32, "xor_32" }, + { xor_16, "xor_16" }, + { xor_8, "xor_8" }, { xor_blocks_wrapper, "xor_blocks" }, }; @@ -3114,10 +3132,10 @@ static void _do_endios(struct raid_set *rs, struct stripe *stripe, SetStripeReconstructed(stripe); /* FIXME: reschedule to be written in case of read. */ - // if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) { - // chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY); - // stripe_chunks_rw(stripe); - // } + /* if (!RSDead && RSDegraded(rs) !StripeRBW(stripe)) { + chunk_set(CHUNK(stripe, stripe->idx.recover), DIRTY); + stripe_chunks_rw(stripe); + } */ stripe->idx.recover = -1; } @@ -3257,7 +3275,7 @@ static void do_ios(struct raid_set *rs, struct bio_list *ios) /* Check for recovering regions. */ sector = _sector(rs, bio); r = region_state(rs, sector, DM_RH_RECOVERING); - if (unlikely(r && bio_data_dir(bio) == WRITE)) { + if (unlikely(r)) { delay++; /* Wait writing to recovering regions. */ dm_rh_delay_by_region(rh, bio, @@ -3409,64 +3427,104 @@ static unsigned mbpers(struct raid_set *rs, unsigned speed) /* * Discover fastest xor algorithm and # of chunks combination. */ -/* Calculate speed for algorithm and # of chunks. */ +/* Calculate speed of particular algorithm and # of chunks. */ static unsigned xor_speed(struct stripe *stripe) { + int ticks = 5; unsigned r = 0; - unsigned long j; - /* Wait for next tick. */ - for (j = jiffies; j == jiffies; ) - ; + /* Do xors for a few ticks. */ + while (ticks--) { + unsigned xors = 0; + unsigned long j = jiffies; + + while (j == jiffies) { + mb(); + common_xor(stripe, stripe->io.size, 0, 0); + mb(); + xors++; + mb(); + } - /* Do xors for a full tick. */ - for (j = jiffies; j == jiffies; ) { - mb(); - common_xor(stripe, stripe->io.size, 0, 0); - mb(); - r++; + if (xors > r) + r = xors; } return r; } +/* Define for xor multi recovery stripe optimization runs. */ +#define DMRAID45_XOR_TEST + /* Optimize xor algorithm for this RAID set. */ static unsigned xor_optimize(struct raid_set *rs) { - unsigned chunks_max = 2, p = rs->set.raid_devs, speed_max = 0; + unsigned chunks_max = 2, p, speed_max = 0; struct xor_func *f = ARRAY_END(xor_funcs), *f_max = NULL; struct stripe *stripe; +unsigned speed_hm = 0, speed_min = ~0, speed_xor_blocks = 0; BUG_ON(list_empty(&rs->recover.stripes)); +#ifndef DMRAID45_XOR_TEST stripe = list_first_entry(&rs->recover.stripes, struct stripe, lists[LIST_RECOVER]); /* Must set uptodate so that xor() will belabour chunks. */ - while (p--) + for (p = rs->set.raid_devs; p-- ;) SetChunkUptodate(CHUNK(stripe, p)); +#endif /* Try all xor functions. */ while (f-- > xor_funcs) { unsigned speed; - /* Set actual xor function for common_xor(). */ - rs->xor.f = f; - rs->xor.chunks = (f->f == xor_blocks_wrapper ? - (MAX_XOR_BLOCKS + 1) : XOR_CHUNKS_MAX) + 1; - - while (rs->xor.chunks-- > 2) { - speed = xor_speed(stripe); - if (speed > speed_max) { - speed_max = speed; - chunks_max = rs->xor.chunks; - f_max = f; +#ifdef DMRAID45_XOR_TEST + list_for_each_entry(stripe, &rs->recover.stripes, + lists[LIST_RECOVER]) { + for (p = rs->set.raid_devs; p-- ;) + SetChunkUptodate(CHUNK(stripe, p)); +#endif + + /* Set actual xor function for common_xor(). */ + rs->xor.f = f; + rs->xor.chunks = (f->f == xor_blocks_wrapper ? + (MAX_XOR_BLOCKS + 1) : + XOR_CHUNKS_MAX) + 1; + + while (rs->xor.chunks-- > 2) { + speed = xor_speed(stripe); + +#ifdef DMRAID45_XOR_TEST + if (f->f == xor_blocks_wrapper) { + if (speed > speed_xor_blocks) + speed_xor_blocks = speed; + } else if (speed > speed_hm) + speed_hm = speed; + + if (speed < speed_min) + speed_min = speed; +#endif + + if (speed > speed_max) { + speed_max = speed; + chunks_max = rs->xor.chunks; + f_max = f; + } } +#ifdef DMRAID45_XOR_TEST } +#endif } - /* Memorize optimum parameters. */ + /* Memorize optimal parameters. */ rs->xor.f = f_max; rs->xor.chunks = chunks_max; +#ifdef DMRAID45_XOR_TEST + DMINFO("%s stripes=%u min=%u xor_blocks=%u hm=%u max=%u", + speed_max == speed_hm ? "HM" : "NB", + rs->recover.recovery_stripes, speed_min, + speed_xor_blocks, speed_hm, speed_max); +#endif return speed_max; } @@ -3786,7 +3844,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv, "Invalid recovery switch; must be \"sync\" or \"nosync\"" }, { 0, "Invalid number of recovery stripes;" - "must be -1, > 0 and <= 16384", + "must be -1, > 0 and <= 64", RECOVERY_STRIPES_MIN, RECOVERY_STRIPES_MAX, &vp->recovery_stripes_parm, &vp->recovery_stripes, NULL }, }, *varp; @@ -3831,7 +3889,7 @@ static int get_raid_variable_parms(struct dm_target *ti, char **argv, if (sscanf(*(argv++), "%d", &value) != 1 || (value != -1 && - ((varp->action && !POWER_OF_2(value)) || + ((varp->action && !is_power_of_2(value)) || !range_ok(value, varp->min, varp->max)))) TI_ERR(varp->errmsg);
Attachment:
xor_optomize_test_data.tar.bz2
Description: application/bzip-compressed-tar
-- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel