On Thu, Nov 03, 2016 at 11:37:48PM -0600, Robert LeBlanc wrote: > On Thu, Nov 3, 2016 at 10:01 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > On Fri, Nov 04 2016, Robert LeBlanc wrote: > > > >> This is always triggered for small reads preventing spreading the reads > >> across all available drives. The comments are also confusing as it is > >> supposed to apply only to 'far' layouts, but really only applies to 'near' > >> layouts. Since there isn't problems with 'far' layouts, there shouldn't > >> be a problem for 'near' layouts either. This change fairly distributes > >> reads across all drives where before only came from the first drive. > > > > Why is "fairness" an issue? > > The current code will use a device if it finds that it is completely > > idle. i.e. if nr_pending is 0. > > Why is that ever the wrong thing to do? > > The code also looks for a drive that is closest to the requested > sector which doesn't get a chance to happen without this patch. The > way this part of code is written, as soon as it finds a good disk, it > cuts out of the loop searching for a better disk. So it doesn't even > look for another disk. In a healthy array with array-disks X and -p > nX, this means that the first disk gets all the reads for small I/O. > Where nY is less than X, it may be covered up because the data is > naturally striped, but it still may be picking a disk that is farther > away from the selected sector causing extra head seeks. > > > Does your testing show that overall performance is improved? If so, > > that would certainly be useful. > > But it isn't clear (to me) that simply spreading the load more "fairly" > > is a worthy goal. > > I'll see if I have some mechanical drives somewhere to test (I've been > testing four loopback devices on a single NVME drive so you don't see > an improvement). You can see from the fio I posted [1] that before the > patch, one drive had all the I/O and after the patch the I/O was > distributed between all the drives (it doesn't have to be exactly > even, just not as skewed as it was before is good enough). I would > expect similar results to the 'far' tests done here [0]. Based on the > previous tests I did, when I saw this code, it just made complete > sense to me why we had great performance with 'far' and subpar > performance with 'near'. I'll come back with some results tomorrow. But in your test, iodepth is 1. So nr_pending is always 0 when we try to choose a disk. In this case, always dispatching it to one disk doesn't matter. If your test has high iodepth, the io will be distributed to all disks as the first disk's nr_pending will not be 0. That said the distribution algorithm does have problem. We should have different algorithms for SSD and hardisk because seek isn't a problem for SSD. I fixed it for raid1, but not raid10. I think we should do something similar for raid10. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html