For some years I've been working on some niche filesystems which serve workflows involving lots of video. Lately, I have had occasion to investigate the behavior of md as a possible raid solution (2.6.32 kernel). As part of that, we looked at some fio based loads in the buffered and O_DIRECT cases and noticed some reading that we didn't understand when using O_DIRECT. We were led to this comparision by incorrect information from a vendor. (We were trying to repro some reported performance and were initially told that O_DIRECT had been used). We are aware of the problems discussed concerning O_DIRECT. As fs guys, we're accustomed to worrying about copies and such, so it wasn't immediately obvious to us that O_DIRECT would be a mistake in our case. This is essentially an embedded system with a single process owning a group of disks with no filesystem. There is no possibility of a race with another process. Anyway, I am curious about this reading behavior and I would grateful for any comments. I tried writing single stripes under both scenarios. To give the barest possible summary. I used a dd command like this with oflag=direct omitted or not. This was driven from a script that sets up some blktrace and ftrace things, waits an appropriate time in the buffered case etc. dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1 8+2 128k strip [physical disk completions via blkparse] Buffered: Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB Direct Example 1: Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB Direct Example 2: Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB I was able to gain a little bit of insight through blktrace and ftrace. Our initial assumption was that maybe things were being broken up differently such that md thought it needed to do a rmv. But as I dug into the blktrace output, that did not seem to be the case (reads are coming after what is obviously the strip write). I used ftrace to show me the path down to md_make_request in the O_DIRECT and buffered cases. This showed me some calls refering to read_ahead in the direct case. <...>-14859 [001] 510340.525310: md_make_request <...>-14859 [001] 510340.525311: <stack trace> => generic_make_request => submit_bio => submit_bh => block_read_full_page => blkdev_readpage => __do_page_cache_readahead => force_page_cache_readahead => page_cache_sync_readahead So is this read ahead I'm observing? Why does it occur only in the direct case? I noticed that blktrace sometime identifies what I assume to be the instigator of the io. So I can sometimes see dd or md_raid6 there. As in [dd] or [md0_raid6]: 8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6] These unexplained reads either mention blkid or [0] or [(null)]. It isn't clear to me whether the unexpected read behavior is due to a tuning problem in the O_DIRECT case or simply the way things work. Thank you for any comments.-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html