Andreas Dilger <adilger@xxxxxxxxx> writes: > On 2012-01-24, at 9:56, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: >> On Tue, Jan 24, 2012 at 10:15:04AM -0500, Chris Mason wrote: >>> https://lkml.org/lkml/2011/12/13/326 >>> >>> This patch is another example, although for a slight different reason. >>> I really have no idea yet what the right answer is in a generic sense, >>> but you don't need a 512K request to see higher latencies from merging. >> >> That assumes the 512k requests is created by merging. We have enough >> workloads that create large I/O from the get go, and not splitting them >> and eventually merging them again would be a big win. E.g. I'm >> currently looking at a distributed block device which uses internal 4MB >> chunks, and increasing the maximum request size to that dramatically >> increases the read performance. > > (sorry about last email, hit send by accident) > > I don't think we can have a "one size fits all" policy here. In most > RAID devices the IO size needs to be at least 1MB, and with newer > devices 4MB gives better performance. Right, and there's more to it than just I/O size. There's access pattern, and more importantly, workload and related requirements (latency vs throughput). > One of the reasons that Lustre used to hack so much around the VFS and > VM APIs is exactly to avoid the splitting of read/write requests into > pages and then depend on the elevator to reconstruct a good-sized IO > out of it. > > Things have gotten better with newer kernels, but there is still a > ways to go w.r.t. allowing large IO requests to pass unhindered > through to disk (or at least as far as enduring that the IO is aligned > to the underlying disk geometry). I've been wondering if it's gotten better, so decided to run a few quick tests. kernel version 3.2.0, storage: hp eva fc array, i/o scheduler cfq, max_sectors_kb: 1024, test program: dd ext3: - buffered writes and buffered O_SYNC writes, all 1MB block size show 4k I/Os passed down to the I/O scheduler - buffered 1MB reads are a little better, typically in the 128k-256k range when they hit the I/O scheduler. ext4: - buffered writes: 512K I/Os show up at the elevator - buffered O_SYNC writes: data is again 512KB, journal writes are 4K - buffered 1MB reads get down to the scheduler in 128KB chunks xfs: - buffered writes: 1MB I/Os show up at the elevator - buffered O_SYNC writes: 1MB I/Os - buffered 1MB reads: 128KB chunks show up at the I/O scheduler So, ext4 is doing better than ext3, but still not perfect. xfs is kicking ass for writes, but reads are still split up. Cheers, Jeff -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel