On 4/2/07, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
On 3/30/07, Raz Ben-Jehuda(caro) <raziebe@xxxxxxxxx> wrote: > Please see bellow. > > On 8/28/06, Neil Brown <neilb@xxxxxxx> wrote: > > On Sunday August 13, raziebe@xxxxxxxxx wrote: > > > well ... me again > > > > > > Following your advice.... > > > > > > I added a deadline for every WRITE stripe head when it is created. > > > in raid5_activate_delayed i checked if deadline is expired and if not i am > > > setting the sh to prereadactive mode as . > > > > > > This small fix ( and in few other places in the code) reduced the > > > amount of reads > > > to zero with dd but with no improvement to throghput. But with random access to > > > the raid ( buffers are aligned by the stripe width and with the size > > > of stripe width ) > > > there is an improvement of at least 20 % . > > > > > > Problem is that a user must know what he is doing else there would be > > > a reduction > > > in performance if deadline line it too long (say 100 ms). > > > > So if I understand you correctly, you are delaying write requests to > > partial stripes slightly (your 'deadline') and this is sometimes > > giving you a 20% improvement ? > > > > I'm not surprised that you could get some improvement. 20% is quite > > surprising. It would be worth following through with this to make > > that improvement generally available. > > > > As you say, picking a time in milliseconds is very error prone. We > > really need to come up with something more natural. > > I had hopped that the 'unplug' infrastructure would provide the right > > thing, but apparently not. Maybe unplug is just being called too > > often. > > > > I'll see if I can duplicate this myself and find out what is really > > going on. > > > > Thanks for the report. > > > > NeilBrown > > > > Neil Hello. I am sorry for this interval , I was assigned abruptly to > a different project. > > 1. > I'd taken a look at the raid5 delay patch I have written a while > ago. I ported it to 2.6.17 and tested it. it makes sounds of working > and when used correctly it eliminates the reads penalty. > > 2. Benchmarks . > configuration: > I am testing a raid5 x 3 disks with 1MB chunk size. IOs are > synchronous and non-buffered(o_direct) , 2 MB in size and always > aligned to the beginning of a stripe. kernel is 2.6.17. The > stripe_delay was set to 10ms. > > Attached is the simple_write code. > > command : > simple_write /dev/md1 2048 0 1000 > simple_write raw writes (O_DIRECT) sequentially > starting from offset zero 2048 kilobytes 1000 times. > > Benchmark Before patch > > sda 1848.00 8384.00 50992.00 8384 50992 > sdb 1995.00 12424.00 51008.00 12424 51008 > sdc 1698.00 8160.00 51000.00 8160 51000 > sdd 0.00 0.00 0.00 0 0 > md0 0.00 0.00 0.00 0 0 > md1 450.00 0.00 102400.00 0 102400 > > > Benchmark After patch > > sda 389.11 0.00 128530.69 0 129816 > sdb 381.19 0.00 129354.46 0 130648 > sdc 383.17 0.00 128530.69 0 129816 > sdd 0.00 0.00 0.00 0 0 > md0 0.00 0.00 0.00 0 0 > md1 1140.59 0.00 259548.51 0 262144 > > As one can see , no additional reads were done. One can actually > calculate the raid's utilization: n-1/n * ( single disk throughput > with 1M writes ) . > > > 3. The patch code. > Kernel tested above was 2.6.17. The patch is of 2.6.20.2 > because I have noticed a big code differences between 17 to 20.x . > This patch was not tested on 2.6.20.2 but it is essentialy the same. I > have not tested (yet) degraded mode or any other non-common pathes. > This is along the same lines of what I am working on, new cache policies for raid5/6, so I want to give it a try as well. Unfortunately gmail has mangled your patch. Can you resend as an attachment? patch: **** malformed patch at line 10: (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])) Thanks, Dan
Dan hello. Attached are the patches. Also , I have added another test unit : random_writev. It is not much of a code but it does the work. It tests writing a vector .it shows the same results as writing using a single buffer. What is the new cache poilcies ? Please note ! I haven't indented the patch nor did the instructions according to SubmitingPatches document. If Neil would approve this patch or parts of it, I will do so. # Benchmark 3: Testing 8 disks raid5. Tyan Numa dual (amd) CPU machine, with 8 sata maxtor disks, controller is promise in jbod mode. raid conf: md1 : active raid5 sda2[0] sdh1[7] sdg1[6] sdf1[5] sde1[4] sdd1[3] sdc1[2] sdb2[1] 3404964864 blocks level 5, 1024k chunk, algorithm 2 [8/8] [UUUUUUUU] In order to achieve zero reads I had to tune the deadline to 20ms ( so long ? ). stripe_cache_size is 256 which is exactly what is needed to preform a full stripe hit with this configuration.
comand: random_writev /dev/md1 7168 0 3000 10000
iostats snapshot avg-cpu: %user %nice %sys %iowait %idle 0.00 0.00 21.00 29.00 50.00 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn hda 0.00 0.00 0.00 0 0 md0 0.00 0.00 0.00 0 0 sda 234.34 0.00 50400.00 0 49896 sdb 235.35 0.00 50658.59 0 50152 sdc 242.42 0.00 51014.14 0 50504 sdd 246.46 0.00 50755.56 0 50248 sde 248.48 0.00 51272.73 0 50760 sdf 245.45 0.00 50755.56 0 50248 sdg 244.44 0.00 50755.56 0 50248 sdh 245.45 0.00 50755.56 0 50248 md1 1407.07 0.00 347741.41 0 344264 Try setting it the stripe_cace_size to 255 and you will notice the delay. Try lowering with the stripe_deadline and you will notice how the amount of reads grow. Cheers -- Raz
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/drivers/md/raid5.c linux-2.6.20.2-raid/drivers/md/raid5.c --- linux-2.6.20.2/drivers/md/raid5.c 2007-03-09 20:58:04.000000000 +0200 +++ linux-2.6.20.2-raid/drivers/md/raid5.c 2007-03-30 12:37:55.000000000 +0300 @@ -65,6 +65,7 @@ #define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head)) #define HASH_MASK (NR_HASH - 1) + #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])) /* bio's attached to a stripe+device for I/O are linked together in bi_sector @@ -234,6 +235,8 @@ sh->sector = sector; sh->pd_idx = pd_idx; sh->state = 0; + sh->active_preread_jiffies = + msecs_to_jiffies( atomic_read(&conf->deadline_ms) )+ jiffies; sh->disks = disks; @@ -628,6 +631,7 @@ clear_bit(R5_LOCKED, &sh->dev[i].flags); set_bit(STRIPE_HANDLE, &sh->state); + sh->active_preread_jiffies = jiffies; release_stripe(sh); return 0; } @@ -1255,8 +1259,11 @@ bip = &sh->dev[dd_idx].towrite; if (*bip == NULL && sh->dev[dd_idx].written == NULL) firstwrite = 1; - } else + } else{ bip = &sh->dev[dd_idx].toread; + sh->active_preread_jiffies = jiffies; + } + while (*bip && (*bip)->bi_sector < bi->bi_sector) { if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector) goto overlap; @@ -2437,13 +2444,27 @@ -static void raid5_activate_delayed(raid5_conf_t *conf) +static struct stripe_head* raid5_activate_delayed(raid5_conf_t *conf) { if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) { while (!list_empty(&conf->delayed_list)) { struct list_head *l = conf->delayed_list.next; struct stripe_head *sh; sh = list_entry(l, struct stripe_head, lru); + + if( time_before(jiffies,sh->active_preread_jiffies) ){ + PRINTK("deadline : no expire sec=%lld %8u %8u\n", + (unsigned long long) sh->sector, + jiffies_to_msecs(sh->active_preread_jiffies), + jiffies_to_msecs(jiffies)); + return sh; + } + else{ + PRINTK("deadline: expire:sec=%lld %8u %8u\n", + (unsigned long long)sh->sector, + jiffies_to_msecs(sh->active_preread_jiffies), + jiffies_to_msecs(jiffies)); + } list_del_init(l); clear_bit(STRIPE_DELAYED, &sh->state); if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) @@ -2451,6 +2472,7 @@ list_add_tail(&sh->lru, &conf->handle_list); } } + return NULL; } static void activate_bit_delay(raid5_conf_t *conf) @@ -3191,7 +3213,7 @@ */ static void raid5d (mddev_t *mddev) { - struct stripe_head *sh; + struct stripe_head *sh,*delayed_sh=NULL; raid5_conf_t *conf = mddev_to_conf(mddev); int handled; @@ -3218,8 +3240,10 @@ atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD && !blk_queue_plugged(mddev->queue) && !list_empty(&conf->delayed_list)) - raid5_activate_delayed(conf); - + delayed_sh=raid5_activate_delayed(conf); + + if(delayed_sh) break; + while ((bio = remove_bio_from_retry(conf))) { int ok; spin_unlock_irq(&conf->device_lock); @@ -3254,9 +3278,51 @@ unplug_slaves(mddev); PRINTK("--- raid5d inactive\n"); + if (delayed_sh){ + long wakeup=delayed_sh->active_preread_jiffies-jiffies; + PRINTK("--- raid5d inactive sleep for %d\n", + jiffies_to_msecs(wakeup) ); + if (wakeup>0) + mddev->thread->timeout = wakeup; + } +} + +static ssize_t +raid5_show_stripe_deadline(mddev_t *mddev, char *page) +{ + raid5_conf_t *conf = mddev_to_conf(mddev); + if (conf) + return sprintf(page, "%d\n", atomic_read(&conf->deadline_ms)); + else + return 0; } static ssize_t +raid5_store_stripe_deadline(mddev_t *mddev, const char *page, size_t len) +{ + raid5_conf_t *conf = mddev_to_conf(mddev); + char *end; + int new; + if (len >= PAGE_SIZE) + return -EINVAL; + if (!conf) + return -ENODEV; + new = simple_strtoul(page, &end, 10); + if (!*page || (*end && *end != '\n') ) + return -EINVAL; + if (new < 0 || new > 10000) + return -EINVAL; + atomic_set(&conf->deadline_ms,new); + return len; +} + +static struct md_sysfs_entry +raid5_stripe_deadline = __ATTR(stripe_deadline, S_IRUGO | S_IWUSR, + raid5_show_stripe_deadline, + raid5_store_stripe_deadline); + + +static ssize_t raid5_show_stripe_cache_size(mddev_t *mddev, char *page) { raid5_conf_t *conf = mddev_to_conf(mddev); @@ -3297,6 +3363,9 @@ return len; } + + + static struct md_sysfs_entry raid5_stripecache_size = __ATTR(stripe_cache_size, S_IRUGO | S_IWUSR, raid5_show_stripe_cache_size, @@ -3318,8 +3387,10 @@ static struct attribute *raid5_attrs[] = { &raid5_stripecache_size.attr, &raid5_stripecache_active.attr, + &raid5_stripe_deadline.attr, NULL, }; + static struct attribute_group raid5_attrs_group = { .name = NULL, .attrs = raid5_attrs, @@ -3567,6 +3638,8 @@ blk_queue_merge_bvec(mddev->queue, raid5_mergeable_bvec); + atomic_set(&conf->deadline_ms,0); + return 0; abort: if (conf) {
diff -ruN -X linux-2.6.20.2/Documentation/dontdiff linux-2.6.20.2/include/linux/raid/raid5.h linux-2.6.20.2-raid/include/linux/raid/raid5.h --- linux-2.6.20.2/include/linux/raid/raid5.h 2007-03-09 20:58:04.000000000 +0200 +++ linux-2.6.20.2-raid/include/linux/raid/raid5.h 2007-03-30 00:25:38.000000000 +0200 @@ -136,6 +136,7 @@ spinlock_t lock; int bm_seq; /* sequence number for bitmap flushes */ int disks; /* disks in stripe */ + unsigned long active_preread_jiffies; struct r5dev { struct bio req; struct bio_vec vec; @@ -254,6 +255,7 @@ * Free stripes pool */ atomic_t active_stripes; + atomic_t deadline_ms; struct list_head inactive_list; wait_queue_head_t wait_for_stripe; wait_queue_head_t wait_for_overlap;
#define _LARGEFILE64_SOURC #include <iostream> #include <stdio.h> #include <string> #include <stddef.h> #include <sys/time.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <libaio.h> #include <time.h> #include <stdio.h> #include <errno.h> #include <sys/uio.h> #include <unistd.h> #include <sys/types.h> #include <linux/unistd.h> #include <errno.h> using namespace std; int main (int argc, char *argv[]) { if (argc<6){ cout << "usage <device name> <size to write in kb> <offset in kb > <diskSizeGB> <loops>" << endl; return 0; } char* dev_name = argv[1]; int fd = open(dev_name, O_LARGEFILE | O_DIRECT | O_WRONLY , 777 ); if (fd<0){ perror("open "); return (-1); } long long write_sz_bytes = ( (long long)atoi(argv[2]))<<10; long long offset_sz_bytes = ( (long long) atoi(argv[3]) )<<10; long long diskSizeBytes = ( (long long)atoi(argv[4]))<<30; int loops = atoi(argv[5]); struct iovec vec[10]; int blocks = (write_sz_bytes >>20); for( int i = 0 ; i < blocks; i++){ char* buffer = (char*)valloc((1<<20)); if (!buffer) { perror("alloc : "); return -1; } vec[i].iov_base = buffer; vec[i].iov_len = 1048576; memset(buffer,0x00,1048576); } int ret=0; while( (--loops)>0 ){ if ( lseek64(fd,offset_sz_bytes,SEEK_SET) < 0 ){ printf("%s: failed on lseek offset=%lld\n",offset_sz_bytes); return (0); } ret = writev(fd,(struct iovec*)&vec,blocks); if ( ret != write_sz_bytes ) { perror("failed to write: "); printf("write size=%lld offset=%lld\n",write_sz_bytes,offset_sz_bytes); return -1; } offset_sz_bytes = write_sz_bytes *( random() % diskSizeBytes ); long long rnd = (long long)random(); offset_sz_bytes = write_sz_bytes * (long long)( rnd % diskSizeBytes ); if(offset_sz_bytes>diskSizeBytes){ offset_sz_bytes = (offset_sz_bytes - diskSizeBytes ) % diskSizeBytes; offset_sz_bytes = (offset_sz_bytes/write_sz_bytes)*write_sz_bytes; } printf("writing %d bytes at offset %lld\n",ret,offset_sz_bytes); } return(0); }