"Also sprach Neil Brown:" > > The following message was posted on linux-raid earlier today. I > > forward it for comment. > > Odd that I didn't see it, but no-matter. OK - that's the second time that's been said to me. Perhaps I should resend it. I'll try reflecting this cnversation to the list and see if it turns up. > > Here is a preliminary patch for raid1 (2.4) to make it choose the > > fastest device in the array to read from. > > There is obvious value in that.. Yes. > > The standard kernel code changes the read device every 128 sectors, in > > equal-handed style. That's not so good when one device is much slower > > than the other, as is the cse if one is a network device and one is a > > local device. In principle, any algorithm which applies some knowledge > > or intelligence will do better than a 50/50 strategy, provided it's at > > all oriented in the right direction. > > I've never really liked the "switch after 128 sectors" strategy. It > doesn't obviously make sense. > Maybe "switch if this drive has more sectors currently queued than the > other drive, by a difference of at least 128"... Oh, I see - you want to maintain counters of requests in-minus-out for each component device. That's possible, yes. We can see when we submit a buffer head to the component device (with make_request), and we can see when it comes back finished, via its end_io function (we have modified it in order to be able to ack the original when the last i/o finishes). However, the optimum strategy is clearly "always choose the fastest for read", and so anything else is potentially suboptimal, and therefore why not directly measure how fast each disk is? The question is whether the excess number of sectors queued is an direct or indirect indicator of the speed, and I'd say it is indirect, since the kernel might at any particular point simply have chosen to run one device's request function and not yet have run the others. That makes the strategy have a certain probability of misestimating the fastest. > > The patch works by replacing the swapover to the next device=20 > > at 128 sectors by a short blast of testing to get a picture of the > > latency, then a swapover to the fastest device found, which it stays on > > for 1024 sectors instead of 128. Here is what logging shows: > > The "read a bit here, and bit there" approach feels a bit untidy, > though not excessively so. Maybe one should only do it once, at device insertion? Or perhaps only at very sporadic intervals, nowehere near as frequently as every 1000 requests. > What would you think of instead timing writes - which happen across > all drives in parallel. Writes are not necessarily synchronous (indeed, they are usually not). That might well give a false reading for short bursts. > That would give a similar indication. I'm not sure about that. > For read-only loads you could time the super-block update. Though Which happens once. Well, one could run the testing phase just once. But it "feels" safer to run it at regular intervals. Every 0.5MB is perhaps a bit frequent. Every 8MB might be more appropriate, or one could use some other trigger, such as the observed variability in the latency of the currently used device. > maybe for a readonly load you wouldn't need to update the superblock > at all.... One can also issue "double" read requests during testing, one to each device, and discard the slowest (ack the original request on the first received, however) . > Or maybe just time every read request and rely on occasional seeks to > make sure every drive gets timed... Because in-sequence reads stay on the same device? I disabled the line that kept them there, because one should not stay on the slowest device if by mistake one is on the slowest ... > You say that you cannot meaningfully test in less than 10 sectors, but > how much precision do you need? I'm not sure a 10% difference should I'm not sure - I only tested with loopback devices (under a UML layer). I don't presently have a feel for real device latencies. It is unusual for loopback devices to measure as more than 0 jiffies latency, but they do sometimes return 10 jiffies late, which is an infinity-to-one variability :(. > really have much consequence. As this is particularly for automatic > detection of network devices, would not you be expecting more like > 1000% differences? I don't know - but I was not happy with the precision in relation to the loop devices I was testing on. Is a Gigabit or 100BT ethernet to a cached remote device so much slower (in terms of latency) than a local disk? My concern is arithmetic - doing integer arithmetic to calculate rolling averages with a weighting of about 9:1 (i.e. each new info acts like about one tenth of the total information stored in the rolling average), at averages near zero there is a huge hysteresis effect that makes it very hard to move the average off zero, or one, or whatever it is currently at. That means that the new info is effectively discarded unless it is at least 10 jiffies different from the average. If I change the weighting to 4:1, then only about 4 trials are stored in the rolling average, but he new info makes a difference if it is at least 4 jiffies different from the average. I can fix this by storing the latency in units of 1/10 of a jiffy, or less (so each jiffy counts 10), but I did not want to complicate things. > I think I would prefer the result of the latency tests to be to > black-list slow devices for reading, or atleast put a heavy weight > against them, so that e.g. a "slow" device only ever gets half the reads > of a "fast" device ( if the difference is 2:1 or less). I tried that originally - by setting the swapover point to be proportional to the inverse of the latency. That way devices with low latency were read from for a long time. But it looked silly - since the idea is to choose the fastest, why spend a constant proportion of the time on the slowest, which is what will happen? > Comments? As above! Thanks very much for the feedback. Peter - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html