On Mon, Jan 19, 2009 at 01:24:39PM +0100, Peter Rabbitson wrote: > Keld Jørn Simonsen wrote: > > Hmm, > > > > Why is the command > > > > blockdev --setra 65536 /dev/md0 > > > > really needed? I think the kernel should set a reasonable default here. > > The in-kernel default for a block device is 256 (128k) which is way too > low. the MD subsystems tries to be a bit smarter and assigns the md > device readahead according to the number of devices/raid level. For > streaming (i.e. file sever) these values are also too low. LVs can take > a readahead specification at creation time and use that, but this is > manual. I would like to have something done automatically in the kernel, so that you do not need to do it manually. People tend to not know that you need to add the blockdev statement, eg in /etc/rc.local, to get decent performance. And this is needed even for simpler arrays, such as a 4 drive raid10,f2 , which can be set up on many recent motherboards with sata-II support directly off the mobo. > It is arguable what the typical workload is, but I would lean towards > big long linear reads (fileserver) vs short scattered ones (database). My understanding is that the readahead is only done when the kernel thinks it is doing sequential reads. his prpbalu is not the case whan doing database operations. So we are kind of safe here, IMHO. > > The real solution to the problem was proposed a long time ago, and it > seems it got lost in the attic: http://lwn.net/Articles/155510/ Yes, interesting. The patch may nt be ready for inclusion for some time due to complexity and lack of testing. So I am wondering if we could come up with a formula to set the readahead for raid. It seems like a big readahead would not affect random reading. It would then only be overkill for sequential reading of smallish files. So how does the kernel detect that it is doing sequential reading? Maybe it detects that the new block to read or a specific file descriptor is the follower to the previous read on the same FD? And then we normally read a full chunk for the raid, which is at least something like 64 KiB? This would take care of most database transactions. I would think we then should find the smallest readahead value for a given array, from chunk size and drive count, that gets the array to perform as expected. best regards keld -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html