Here is a preliminary patch for raid1 (2.4) to make it choose the fastest device in the array to read from. The standard kernel code changes the read device every 128 sectors, in equal-handed style. That's not so good when one device is much slower than the other, as is the cse if one is a network device and one is a local device. In principle, any algorithm which applies some knowledge or intelligence will do better than a 50/50 strategy, provided it's at all oriented in the right direction. I'll work through the patch, commenting on it¸ then I'll repeat the patch, whole, at the end. The patch works by replacing the swapover to the next device at 128 sectors by a short blast of testing to get a picture of the latency, then a swapover to the fastest device found, which it stays on for 1024 sectors instead of 128. Here is what logging shows: mdadm: /dev/md/0 has been started with 2 drives. sh-2.03# dd if=/dev/md/0 of=/dev/null raid1: disk 0 latency 0 abandoned after 1024 sectors raid1: choosing disk 1 latency 0 <- test raid1: disk 1 latency 0 abandoned after 32 sectors _| raid1: choosing disk 0 latency 0 <- test raid1: disk 0 latency 0 abandoned after 32 sectors _| raid1: choosing disk 1 latency 0 <- run raid1: disk 1 latency 0 abandoned after 1024 sectors _| raid1: choosing disk 0 latency 0 raid1: disk 0 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 0 abandoned after 1024 sectors raid1: choosing disk 0 latency 0 raid1: disk 0 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 0 abandoned after 1024 sectors raid1: choosing disk 0 latency 0 raid1: disk 0 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 0 abandoned after 32 sectors raid1: choosing disk 1 latency 0 raid1: disk 1 latency 2 abandoned after 1024 sectors raid1: choosing disk 0 latency 0 ... (I was running on loopback devices, so the latency is not very realistic, but that's not the point). --- linux-2.4.25/drivers/md/raid1.c.post-fr1-2.14b,pre-read-balance Tue Aug 10 21:36:51 2004 +++ linux-2.4.25/drivers/md/raid1.c Mon Aug 9 19:27:32 2004 @@ -46,7 +46,7 @@ #define MD_DRIVER #define MD_PERSONALITY -#define MAX_WORK_PER_DISK 128 +#define MAX_WORK_PER_DISK (128 * 8) #define NR_RESERVED_BUFS 32 I replace swapover after 128 sectors by swapover at 1024 sectors. I want the testing to not be more than about 1% of the running period, and one can't test meaningfully in less than about 10 sectors, so the run period has to be about 1000 sectors. Back of envelope calculation. I should #define MAX_TEST_PER_DISK 32. @@ -434,6 +434,31 @@ bitmap->clearbits(bitmap, bh->b_rsector >> 1, bh->b_size >> 10); } } + /* PTB calculate the latency of the read device and update the record */ We're in the end_io routine for a raid1 buffer head. I forget what it's called (:-() and exactly what a r1_bh is, but it's approximately a master controller for a whole splat of other buffer heads that are sent to component devices in the case of writes. There's only one buffer head in the splat in the case of a read, but the r1_bh contains useful info anyway, such as a pointer back to the original bh that landed on the raid device and caused all the kerfuffle. Here we take the opportunity to examine a field I added to the r1_bh to record when we started the i/o, look at the time now, calculate the difference - that's the latency, folks - and add it into a rolling average that I am recording for the component device. + if (uptodate && (r1_bh->cmd == READ || r1_bh->cmd == READA)) { Only for successful reads, of course! I don't think we can get READAheads at this point, but checking does no harm. + unsigned long latency = jiffies - r1_bh->start_jiffies; That's the latency for this just-completing read i/o. This will fail at the 32 bit rollover point. Improvement welcome. + kdev_t dev = (&r1_bh->bh_req)->b_dev; That's the device the buffer head was aimed at. I need to get back from that to the index of the device in the array. + int i; + + /* PTB find the mirror component being read */ + for (i = 0; i < conf->raid_disks; i++) { + if (conf->mirrors[i].dev == dev) + break; + } Was there a better way to find the index in the array of mirror components? I didn't like searching through them all on each buffer head, even though in general the list will be exactly two long. + if (i < conf->raid_disks) { If we found the device ... + if (latency < 120 * HZ && latency >= 0) { and the calculated latency is not completely wacko ... + conf->latency[i] = 9 * conf->latency[i] + + latency; + conf->latency[i] /= 10; roll the latency into the rolling average. I'm using a 90:10 rolloff. I used a parallel array of rollong averages added to the conf structure. It would have been nicer to use the mirror struct, but I didn't want to step out of the raid1 into the md environment (is it sited there? I don't know). + } else { + printk(KERN_ERR "raid1: bad latency %lu jiffies\n", + latency); + } Mention if the latency was absurd. This will cause noise in 75 years. + } else { + printk(KERN_ERR "raid1: could not find dev %02x:%02x\n", + MAJOR(dev), MINOR(dev)); + } + } What do we do if the i/o is for a device not in the array now? Yell and scream? We ought to do something. Presumably we can take the disk out of the array just after i/o for it has been done and just before we get notified about it. Is there a lock to stop removal while requests are in flight? I suspect we should not lock, but just error now. raid1_free_r1bh(r1_bh); } @@ -569,7 +594,7 @@ * Don't touch anything for sequential reads. */ Here we are back in the main routine that deals with incoming requests and launches buffer heads/requests out to the component devices in response. We'll have to choose the right device to read from. - if (this_sector == conf->mirrors[new_disk].head_position) + if (0 && /* PTB */ this_sector == conf->mirrors[new_disk].head_position) goto rb_out; /* I disabled the line that says "keep on reading from the same device if we are running a linear sequence of requests". Since I upped the swapover period by 10, it can't make more than about 10% difference either way, and I wanted to see the swapovers. I don't like this line anyway! Who says we will receive linear requests in linear sequence? And what if somebody else asks to read somewhere else meanwhile? @@ -578,22 +603,66 @@ * This is for kicking those idling disks so that * they would find work near some hotspot. */ if (conf->sect_count >= conf->mirrors[new_disk].sect_limit) { - conf->sect_count = 0; This is the counter for how many sectors we have read without swapping over to read from somewhere else. It ticks along until it triggers the swapover limit (which I set higher, to 1024 instead of 128) and is reset. I moved the reset further down ... + + PRINTK(KERN_INFO + "raid1: disk %d latency %d abandoned after %d sectors\n", + new_disk, + conf->latency[new_disk], + conf->sect_count); ... because I wanted to print the value at swapover. + + if (conf->last_source < 0) { This is a field I introduced in order to govern the swapover algorithm. It records the (index of the) last device we have read from continuously and deliberately, not the last device read from (for which a field already exists), because sometimes we just are reading in order to test, which is not to be counted as "deliberately". Perhaps "last_worked" would be a better name. I want to distinguish between "working reads" and "testing reads". Anyway, when the field is -1, it indicates "don't know" and means that we should go and find the fastest device already known and use it, setting the field to the device index we used. Later, the next time we come to the swapover point, the field will be nonnegative, which indicates that we should retest the devices in short read bursts, and only later choose the fastest of them. + /* PTB find the fastest already known and use that */ + + int fastest = -1; + unsigned long best_latency = 0x7fffffff; + int i; + + for (i = 0; i < conf->raid_disks; i++) { + if (conf->mirrors[i].write_only + || !conf->mirrors[i].operational) + continue; + if (conf->latency[i] <= best_latency) { + best_latency = conf->latency[i]; + fastest = i; + } + } That found the fastest known. I guess this search now (infrequent) is faster than updating the index as we update the rolling latency for each device on completion of each r1_bh, but we could do it that way instead. This way will scan two points on average, every 1024 sectors, so it will be better than doing something else 512 times meanwhiles. I think. + if (fastest >= 0) + new_disk = fastest; I don't know what happens if we didn't find a fastest! new_disk stays at whatever it already was, which I think is the disk we were using till now (last_used?). + conf->mirrors[new_disk].sect_limit = + MAX_WORK_PER_DISK; We set the next swapover point to be the normal value .. 1024 sectors. + conf->last_source = new_disk; And we set the control field for the algorithm to something nonnegative. In fact to us. We need to know who was last used before testing began so we can cycle the tests through all the devices. + } else { + /* PTB move on to run a short test on the next disk */ This is the case when we should be running tests to determine latencies, not doing serious committed reading. #if defined(CONFIG_SPARC64) && (__GNUC__ == 2) && (__GNUC_MINOR__ == 92) /* Work around a compiler bug in egcs-2.92.11 19980921 */ new_disk = *(volatile int *)&new_disk; #endif Weird! - do { - if (new_disk<=0) - new_disk = conf->raid_disks; - new_disk--; - if (new_disk == disk) - break; - } while ((conf->mirrors[new_disk].write_only) || - (!conf->mirrors[new_disk].operational)); That was the old selection algorithm, which just moved us one along in the array of disks (modulo nonworking and not-to-be-read-yet disks). The following uses it instead only in the testing phase, as we want to test all disks. The patch has a delete and add because the line indentation has changed by one. It is now inside one more set of braces. An if. The if that detects that we're in a testing phase. + do { + if (new_disk<=0) + new_disk = conf->raid_disks; + new_disk--; + if (new_disk == disk) + break; /* nothing else available */ + } while ((conf->mirrors[new_disk].write_only) || + (!conf->mirrors[new_disk].operational)); So that was now the "move along and test the next one". + /* PTB if tested all, will need to choose next time */ + if (new_disk == conf->last_source) { + conf->last_source = -1; If we have run around the array once we will hit the "last used deliberately" disk, and should stop testing and start running in a committed fashion next time. So we set the controlling field to -1. + /* PTB don't retest last source at all */ + //conf->mirrors[new_disk].sect_limit = 0; The above is a possible modification that would reduce testing on the disk we used long-term last time to just 2 sectors. This would reduce the testing time by nearly half in 2-device array, but I'm not convinced it avoids anything, since we will likely choose the same device again anyway, and using it now a bit extra is good, not bad. + } + /* PTB only a short test run */ + conf->mirrors[new_disk].sect_limit = 32; This is how we make sure the tests are short! We temporarily set the swapover point low for them. This should be MAX_TEST_whatever. The value will be reset to the large value when we come to choose the device for a long run. + } + + conf->sect_count = 0; Here is the delayed reset of the sectors read counter. + PRINTK(KERN_INFO + "raid1: choosing disk %d latency %d\n", + new_disk, + conf->latency[new_disk]); goto rb_out; } @@ -680,6 +749,7 @@ r1_bh->master_bh = bh; r1_bh->mddev = mddev; r1_bh->cmd = rw; + r1_bh->start_jiffies = jiffies; /* PTB record start time */ async_data = NULL; if (rw == READ) { Aha .. the above was just after the r1_bh we will use is recieved from the allocater, and we set the time on it. It's done at the same time as we set the other fields. --- linux-2.4.25/include/linux/raid/raid1.h.post-fr1-2.14b,pre-read-balance Tue Aug 10 21:48:31 2004 +++ linux-2.4.25/include/linux/raid/raid1.h Mon Aug 9 18:53:57 2004 @@ -59,6+59,10 @@ md_wait_queue_head_t wait_done; md_wait_queue_head_t wait_ready; md_spinlock_t segment_lock; + + int latency[MD_SB_DISKS]; + int last_source; /* PTB disk read from */ + }; typedef struct raid1_private_data raid1_conf_t; That was the addition of the latency records and last-used-continuously for reads fields to the raid1 conf structure. @@ -92,6 +96,7 @@ struct buffer_head *mirror_bh_list; struct buffer_head bh_req; struct raid1_bh *next_r1; /* next for retry or in free list */ + unsigned long start_jiffies; /* PTB when i/o started */ }; /* bits for raid1_bh.state */ #define R1BH_Uptodate 1 And the above is the addition of the start time to the r1_bh struct. Now I'll repeat the patch again, whole. Improvements welcome. --------------------------------------------------------------------- Peter T. Breuer MA CASM PhD (Ing.), Prof. Ramon & Cajal Area de Ingenieria Telematica E-mail: ptb@xxxxxxxxxx Dpto. Ingenieria Tel: +34 (9)1 624 87 81 Universidad Carlos III de Madrid Fax: +34 (9)1 624 8749/9430 Butarque 15, Leganes/Madrid URL: http://www.it.uc3m.es/~ptb E-28911 Spain Mob: +34 69 666 7835 --- linux-2.4.25/drivers/md/raid1.c.post-fr1-2.14b,pre-read-balance Tue Aug 10 21:36:51 2004 +++ linux-2.4.25/drivers/md/raid1.c Mon Aug 9 19:27:32 2004 @@ -46,7 +46,7 @@ #define MD_DRIVER #define MD_PERSONALITY -#define MAX_WORK_PER_DISK 128 +#define MAX_WORK_PER_DISK (128 * 8) #define NR_RESERVED_BUFS 32 @@ -434,6 +434,31 @@ bitmap->clearbits(bitmap, bh->b_rsector >> 1, bh->b_size >> 10); } } + /* PTB calculate the latency of the read device and update the record */ + if (uptodate && (r1_bh->cmd == READ || r1_bh->cmd == READA)) { + unsigned long latency = jiffies - r1_bh->start_jiffies; + kdev_t dev = (&r1_bh->bh_req)->b_dev; + int i; + + /* PTB find the mirror component being read */ + for (i = 0; i < conf->raid_disks; i++) { + if (conf->mirrors[i].dev == dev) + break; + } + if (i < conf->raid_disks) { + if (latency < 120 * HZ && latency >= 0) { + conf->latency[i] = 9 * conf->latency[i] + + latency; + conf->latency[i] /= 10; + } else { + printk(KERN_ERR "raid1: bad latency %lu jiffies\n", + latency); + } + } else { + printk(KERN_ERR "raid1: could not find dev %02x:%02x\n", + MAJOR(dev), MINOR(dev)); + } + } raid1_free_r1bh(r1_bh); } @@ -569,7 +594,7 @@ * Don't touch anything for sequential reads. */ - if (this_sector == conf->mirrors[new_disk].head_position) + if (0 && /* PTB */ this_sector == conf->mirrors[new_disk].head_position) goto rb_out; /* @@ -578,22 +603,66 @@ * This is for kicking those idling disks so that * they would find work near some hotspot. */ if (conf->sect_count >= conf->mirrors[new_disk].sect_limit) { - conf->sect_count = 0; + + PRINTK(KERN_INFO + "raid1: disk %d latency %d abandoned after %d sectors\n", + new_disk, + conf->latency[new_disk], + conf->sect_count); + + if (conf->last_source < 0) { + /* PTB find the fastest already known and use that */ + + int fastest = -1; + unsigned long best_latency = 0x7fffffff; + int i; + + for (i = 0; i < conf->raid_disks; i++) { + if (conf->mirrors[i].write_only + || !conf->mirrors[i].operational) + continue; + if (conf->latency[i] <= best_latency) { + best_latency = conf->latency[i]; + fastest = i; + } + } + if (fastest >= 0) + new_disk = fastest; + conf->mirrors[new_disk].sect_limit = + MAX_WORK_PER_DISK; + conf->last_source = new_disk; + } else { + /* PTB move on to run a short test on the next disk */ #if defined(CONFIG_SPARC64) && (__GNUC__ == 2) && (__GNUC_MINOR__ == 92) /* Work around a compiler bug in egcs-2.92.11 19980921 */ new_disk = *(volatile int *)&new_disk; #endif - do { - if (new_disk<=0) - new_disk = conf->raid_disks; - new_disk--; - if (new_disk == disk) - break; - } while ((conf->mirrors[new_disk].write_only) || - (!conf->mirrors[new_disk].operational)); + do { + if (new_disk<=0) + new_disk = conf->raid_disks; + new_disk--; + if (new_disk == disk) + break; /* nothing else available */ + } while ((conf->mirrors[new_disk].write_only) || + (!conf->mirrors[new_disk].operational)); + /* PTB if tested all, will need to choose next time */ + if (new_disk == conf->last_source) { + conf->last_source = -1; + /* PTB don't retest last source at all */ + //conf->mirrors[new_disk].sect_limit = 0; + } + /* PTB only a short test run */ + conf->mirrors[new_disk].sect_limit = 32; + } + + conf->sect_count = 0; + PRINTK(KERN_INFO + "raid1: choosing disk %d latency %d\n", + new_disk, + conf->latency[new_disk]); goto rb_out; } @@ -680,6 +749,7 @@ r1_bh->master_bh = bh; r1_bh->mddev = mddev; r1_bh->cmd = rw; + r1_bh->start_jiffies = jiffies; /* PTB record start time */ async_data = NULL; if (rw == READ) { --- linux-2.4.25/include/linux/raid/raid1.h.post-fr1-2.14b,pre-read-balance Tue Aug 10 21:48:31 2004 +++ linux-2.4.25/include/linux/raid/raid1.h Mon Aug 9 18:53:57 2004 @@ -59,6+59,10 @@ md_wait_queue_head_t wait_done; md_wait_queue_head_t wait_ready; md_spinlock_t segment_lock; + + int latency[MD_SB_DISKS]; + int last_source; /* PTB disk read from */ + }; typedef struct raid1_private_data raid1_conf_t; @@ -92,6 +96,7 @@ struct buffer_head *mirror_bh_list; struct buffer_head bh_req; struct raid1_bh *next_r1; /* next for retry or in free list */ + unsigned long start_jiffies; /* PTB when i/o started */ }; /* bits for raid1_bh.state */ #define R1BH_Uptodate 1 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html