On Wed, 2014-01-22 at 18:37 +0000, Chris Mason wrote: > On Wed, 2014-01-22 at 10:13 -0800, James Bottomley wrote: > > On Wed, 2014-01-22 at 18:02 +0000, Chris Mason wrote: [agreement cut because it's boring for the reader] > > Realistically, if you look at what the I/O schedulers output on a > > standard (spinning rust) workload, it's mostly large transfers. > > Obviously these are misalgned at the ends, but we can fix some of that > > in the scheduler. Particularly if the FS helps us with layout. My > > instinct tells me that we can fix 99% of this with layout on the FS + io > > schedulers ... the remaining 1% goes to the drive as needing to do RMW > > in the device, but the net impact to our throughput shouldn't be that > > great. > > There are a few workloads where the VM and the FS would team up to make > this fairly miserable > > Small files. Delayed allocation fixes a lot of this, but the VM doesn't > realize that fileA, fileB, fileC, and fileD all need to be written at > the same time to avoid RMW. Btrfs and MD have setup plugging callbacks > to accumulate full stripes as much as possible, but it still hurts. > > Metadata. These writes are very latency sensitive and we'll gain a lot > if the FS is explicitly trying to build full sector IOs. OK, so these two cases I buy ... the question is can we do something about them today without increasing the block size? The metadata problem, in particular, might be block independent: we still have a lot of small chunks to write out at fractured locations. With a large block size, the FS knows it's been bad and can expect the rolled up newspaper, but it's not clear what it could do about it. The small files issue looks like something we should be tackling today since writing out adjacent files would actually help us get bigger transfers. James -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>