Mr. Adams, Raid 5/6 exports a parameter called "optimal_io_size". You should find this in /sys/block/mdx/queue/optimal_io_size. This is the size of a single stripe. In theory, if you write exactly this size aligned blocks to raid 5/6, then the stripe cache should handle the IO perfectly and you should see zero reads. If you miss the boundaries, most of the time, raid 5/6 will cache the writes in the stripe cache and you will still get zero reads. Unfortunately, a small percentage of the time, a read/modify/write will get scheduled in between two inbound write requests. To make this somewhat more complicated, there is also a limit to how large a single request can be. This is limited globally to "#define BIO_MAX_PAGES 256" or 1MB (as of 3.1.7). With raid 5/6 arrays, with 64KB chunks, this lets you have 16 active drives. At least so goes the theory. I seem to remember some other limit at 1023 sectors, which then limits you 511KB or 7 active drives. If you need to drive this from an application, then the application has to hit "optimal_io_size" exactly, both in terms of size and alignment. You can test this with 'dd'. If you miss the alignment, then you will get a small number of reads. If you want to drive this from user space, then O_DIRECT will work. Ideally, you want multiple outstanding IOs so that the drives can stream. This implies AIO (which sucked the last time I tried it), or else you need to hack something inside of kernel space. Now why raid 5/6 tends to miss and schedule read/modify/write at inopportune times seems to just be a design trade-off inside of raid. I stared at the code for a long time, and never did find any type of specific timing for how long to wait before scheduling a RMW, so it looks like you are just at the mercy of where clock ticks happen. All in all, the raid 5/6 code is really elegant, but it would be nice if the kernel in general allowed for longer atomic requests. 1MB (or 512KB, or 511KB depending on where you look), is just too short for some "high bandwidth" application. Doug Dumitru EasyCo LLC On Tue, Feb 28, 2012 at 10:47 AM, John Adams <john.adams@xxxxxxxx> wrote: > > For some years I've been working on some niche filesystems which serve > workflows involving lots of video. Lately, I have had occasion to > investigate the behavior of md as a possible raid solution (2.6.32 > kernel). > > As part of that, we looked at some fio based loads in the buffered and > O_DIRECT cases and noticed some reading that we didn't understand when > using O_DIRECT. We were led to this comparision by incorrect > information from a vendor. (We were trying to repro some reported > performance and were initially told that O_DIRECT had been used). > > We are aware of the problems discussed concerning O_DIRECT. As fs > guys, we're accustomed to worrying about copies and such, so it wasn't > immediately obvious to us that O_DIRECT would be a mistake in our > case. This is essentially an embedded system with a single process > owning a group of disks with no filesystem. There is no possibility > of a race with another process. > > Anyway, I am curious about this reading behavior and I would grateful for any > comments. > > I tried writing single stripes under both scenarios. To give the > barest possible summary. I used a dd command like this with > oflag=direct omitted or not. This was driven from a script that > sets up some blktrace and ftrace things, waits an appropriate time in > the buffered case etc. > > dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1 > > 8+2 128k strip > > [physical disk completions via blkparse] > > Buffered: > > Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB > Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB > Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB > > Direct Example 1: > > Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB > Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB > Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB > Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB > Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB > Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB > Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB > Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB > > Direct Example 2: > > Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB > Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB > Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB > Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB > Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB > Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB > Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB > > > I was able to gain a little bit of insight through blktrace and > ftrace. Our initial assumption was that maybe things were being > broken up differently such that md thought it needed to do a rmv. > > But as I dug into the blktrace output, that did not seem to be the > case (reads are coming after what is obviously the strip write). I > used ftrace to show me the path down to md_make_request in the > O_DIRECT and buffered cases. This showed me some calls refering to > read_ahead in the direct case. > > <...>-14859 [001] 510340.525310: md_make_request > <...>-14859 [001] 510340.525311: <stack trace> > => generic_make_request > => submit_bio > => submit_bh > => block_read_full_page > => blkdev_readpage > => __do_page_cache_readahead > => force_page_cache_readahead > => page_cache_sync_readahead > > So is this read ahead I'm observing? Why does it occur only in the > direct case? > > I noticed that blktrace sometime identifies what I assume to be the > instigator of the io. So I can sometimes see dd or md_raid6 there. > As in [dd] or [md0_raid6]: > > 8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6] > > These unexplained reads either mention blkid or [0] or [(null)]. > > It isn't clear to me whether the unexpected read behavior is due to a > tuning problem in the O_DIRECT case or simply the way things work. > > Thank you for any comments.-- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Dumitru EasyCo LLC -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html