On 29/02/12 5:47 AM, John Adams wrote:
For some years I've been working on some niche filesystems which serve
workflows involving lots of video. Lately, I have had occasion to
investigate the behavior of md as a possible raid solution (2.6.32
kernel).
As part of that, we looked at some fio based loads in the buffered and
O_DIRECT cases and noticed some reading that we didn't understand when
using O_DIRECT. We were led to this comparision by incorrect
information from a vendor. (We were trying to repro some reported
performance and were initially told that O_DIRECT had been used).
We are aware of the problems discussed concerning O_DIRECT. As fs
guys, we're accustomed to worrying about copies and such, so it wasn't
immediately obvious to us that O_DIRECT would be a mistake in our
case. This is essentially an embedded system with a single process
owning a group of disks with no filesystem. There is no possibility
of a race with another process.
Anyway, I am curious about this reading behavior and I would grateful for any
comments.
I tried writing single stripes under both scenarios. To give the
barest possible summary. I used a dd command like this with
oflag=direct omitted or not. This was driven from a script that
sets up some blktrace and ftrace things, waits an appropriate time in
the buffered case etc.
dd oflag=direct if=/dev/zero of=/dev/md0 seek=0 bs=1M count=1
8+2 128k strip
[physical disk completions via blkparse]
Buffered:
Reads Completed: 2, 5KiB Writes Completed: 4, 258KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 3, 130KiB
Reads Completed: 2, 8KiB Writes Completed: 3, 130KiB
Reads Completed: 6, 24KiB Writes Completed: 4, 258KiB
Direct Example 1:
Reads Completed: 2, 5KiB Writes Completed: 20, 258KiB
Reads Completed: 9, 36KiB Writes Completed: 14, 130KiB
Reads Completed: 32, 128KiB Writes Completed: 14, 130KiB
Reads Completed: 1, 4KiB Writes Completed: 16, 130KiB
Reads Completed: 32, 128KiB Writes Completed: 12, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 0, 0KiB Writes Completed: 8, 130KiB
Reads Completed: 2, 8KiB Writes Completed: 8, 130KiB
Reads Completed: 6, 24KiB Writes Completed: 19, 258KiB
Direct Example 2:
Reads Completed: 4, 133KiB Writes Completed: 3, 130KiB
Reads Completed: 11, 164KiB Writes Completed: 3, 130KiB
Reads Completed: 34, 256KiB Writes Completed: 3, 130KiB
Reads Completed: 2, 132KiB Writes Completed: 3, 130KiB
Reads Completed: 33, 256KiB Writes Completed: 3, 130KiB
Reads Completed: 3, 136KiB Writes Completed: 3, 130KiB
Reads Completed: 7, 152KiB Writes Completed: 3, 130KiB
I was able to gain a little bit of insight through blktrace and
ftrace. Our initial assumption was that maybe things were being
broken up differently such that md thought it needed to do a rmv.
But as I dug into the blktrace output, that did not seem to be the
case (reads are coming after what is obviously the strip write). I
used ftrace to show me the path down to md_make_request in the
O_DIRECT and buffered cases. This showed me some calls refering to
read_ahead in the direct case.
<...>-14859 [001] 510340.525310: md_make_request
<...>-14859 [001] 510340.525311:<stack trace>
=> generic_make_request
=> submit_bio
=> submit_bh
=> block_read_full_page
=> blkdev_readpage
=> __do_page_cache_readahead
=> force_page_cache_readahead
=> page_cache_sync_readahead
So is this read ahead I'm observing? Why does it occur only in the
direct case?
I noticed that blktrace sometime identifies what I assume to be the
instigator of the io. So I can sometimes see dd or md_raid6 there.
As in [dd] or [md0_raid6]:
8,16 1 115 0.042000000 2910 D W 2256 + 48 [md0_raid6]
These unexplained reads either mention blkid or [0] or [(null)].
It isn't clear to me whether the unexpected read behavior is due to a
tuning problem in the O_DIRECT case or simply the way things work.
Thank you for any comments.--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
G'day John,
You need to give us more detail about your md raid setup. Beside a
reference to md_raid6, there is no other details about your array.
How about sending:
mdadm -V
uname -a
mdadm -Dvv /dev/mdarray
mdadm -Evv /dev/arraycomponentdevices - for all of them
Good luck in the hunt,
J
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html